Circular RNA in Chemonaive Lymph Node Negative Colon Cancer Patients

Simple Summary Colon cancer (CC) is one of the most common types of cancer. Circular RNAs (circRNAs) appear to play an important role in tumor progression of CC. They are stably expressed in saliva, blood, and exosomes, potentially rendering them promising biomarkers for the diagnosis, prognosis, and treatment of CC. In this study we describe the identification of an extensive catalog of circRNAs in a large cohort of 181 chemonaive, stage I/II primary colon tumors and related circRNA expression to consensus molecular subtypes (CMS), microsatellite instability (MSI) status and clinical outcome. We observed that a high diversity in circRNAs was associated with favorable disease-free survival, and that several circRNAs were associated with MSI and CMS, demonstrating the potential clinical value of circRNAs in CC. Abstract Circular RNAs (circRNAs) appear important in tumor progression of colon cancer (CC). We identified an extensive catalog of circRNAs in 181 chemonaive stage I/II colon tumors, who underwent curative surgery between 2007 and 2014. We identified circRNAs from RNAseq data, investigated common biology related to circRNA expression, and studied the association between circRNAs and relapse status, tumor stage, consensus molecular subtypes (CMS), tumor localization and microsatellite instability (MSI). We identified 2606 unique circRNAs. 277 circRNAs (derived from 260 genes) were repeatedly occurring in at least 20 patients of which 153 showed a poor or even negative (R < 0.3) correlation with the expression level of their linear gene. The circular junctions for circSATB2, circFGD6, circKMT2C and circPLEKHM3 were validated by Sanger sequencing. Multiple correspondence analysis showed that circRNAs were often co-expressed and that high diversity in circRNAs was associated with favorable disease-free survival (DFS), which was confirmed by Cox regression analysis (Hazard Ratio (HR) 0.60, 95% CI 0.38–0.97, p = 0.036). Considering individual circRNAs, absence of circMGA was significantly associated with relapse, whereas circSATB2, circNAB1, and circCEP192 were associated with both MSI and CMS. This study represents a showcase of the potential clinical utility of circRNAs for prognostic stratification in patients with stage I–II colon cancer and demonstrated that high diversity in circRNAs is associated with favorable DFS.


Introduction
Colon cancer is one of the most common types of cancer with over 1 million new cases worldwide and around 9800 new cases in the Netherlands in 2018 [1]. Up to 21% 2.2.1. Microsatellite Instability MSI analyses have previously been performed and described [32]. In short, the MSI analyses made use of the MSI Analysis System from Promega©, which is a fluorescent PCR-based assay for detection of microsatellite instability in seven markers, including five mononucleotide repeat markers (BAT-25, BAT-26, NR-21, NR-24 and MONO-27) and two pentanucleotide repeat markers (Penta C and Penta D) [34].

Identification of circRNAs
The methodology used to identify circRNA reads has previously been described in detail [36]. In short, the method developed by Smid et al. uses sequence reads that have a "secondary alignment" (SA) tag. When using paired-end sequence data, and assuming a circRNA molecule is present, the sequence read that aligns over the crossing of the junction would "point toward" its read-mate somewhere in the circle. Aligning these reads to the linear reference, the junction read will get an SA tag which will be assigned to two locations if and only if this is the one and unique alignment configuration the STAR software can find [37,38]. The read-mate aligns somewhere in between these two locations. Finding additional read pairs showing this configuration with a breakpoint at the exact same location strengthens the evidence for circular transcripts. We included only regions with at least five reads crossing the circular junction. After filtering, GENCODE annotation was used to obtain the exon locations of genes that exactly matched to the circular region. For each sample, STAR also gives the raw read counts for all genes. These were normalized (Trimmed Mean of M-values implemented in edgeR [39]), and the normalized read counts were used to correlate with the number of junction reads of the circular transcripts. The script is also available at https://bitbucket.org/snippets/MSmid/ Le949d/identify-circularrna-reads (accessed on 30 October 2018).

Multiple Correspondence Analysis (MCA)
For a substantial number of genes, only a linear transcript is detected in the majority of samples, which results in many missing values per circRNA. This in turn, complicates the use of standard cluster analysis for the identification of sample groups with similar circRNA-related biology. Therefore, circRNA data were considered categorical, i.e., a circRNA was scored as either "present" or "absent" in a sample. These categorical data are suitable for a multiple correspondence analysis (MCA), which is a generalized principle component analysis. An MCA generates a combined plot that shows both patients and circRNAs in such a way that patients and circRNAs that have similar patterns are closer together. Thus, the colon cancer tumor samples and circRNAs are projected onto the same plane, in which the relative distance to either the samples or the circRNAs is meaningful. The 0,0 point corresponds to a sample or circRNA with an average profile. The R-package "ade4" was used to perform the MCA in R version 3.4.1. Custom functions to plot the MCA results are available upon request of the authors.

Reverse Transcription, Quantitative PCR, and Sanger Sequencing
Candidate circular RNAs were selected, and divergent primers were designed that are only able to amplify and detect the circular and not the corresponding linear mRNA (Table S1) using Phusion high-fidelity DNA polymerase (ThermoFisher Scientific) in a total reaction volume of 25 µL, containing 400 nM of each primer and 160 µM dNTPs. For every circRNA two resulting PCR amplicons were purified from gel using the Qiaquick Gel Extraction Kit from Qiagen (Hilden, Germany) according to the manufacturer's protocol and subjected to Quick Shot Sanger Sequencing by BaseClear BV (Leiden, The Netherlands).

Statistical Analyses
As indicated above for a substantial number of genes, only a linear transcript was detected in the majority of samples, which results in many missing values per circRNA. This hampers statistical analyses of circRNA expression levels and therefore we categorized the circRNA data into "present" or "absent" for statistical evaluation as well. We used circRNAs present in at least 20 samples to ensure a sufficient number of events for subsequent statistical analyses. STATA version 14 and SPSS Version 24.0 (SPSS, Inc., Chicago, IL, USA) were used to perform the statistical tests that are also indicated in the text. Cox's proportional-hazards regression was used to evaluate the (log-transformed) number of uniquely present circRNAs per sample, hereafter called circRNA diversity, with DFS, or as "present"/"absent" when evaluating individual circRNAs. Survival curves were evaluated using the logrank test (for individual circRNAs) or with the logrank test for trend (for circRNA diversity, after dividing into three equal quantiles). Pearson's correlation was used to correlate the circRNA expression with the expression of the linear gene it was derived from. Analyses between categorical variables (like present/absent of a circRNA versus MSI yes/no) were analyzed using Fisher's exact test. Reported p-values are two-sided and considered significant at p ≤ 0.05. p-values were corrected for multiple testing using Benjamini-Hochberg's FDR correction when evaluating multiple circRNAs, which were considered significant at p < 0.10.

CircRNA Expression in Colon Cancer
We analyzed RNAseq data of 181 patients with chemonaive, stage I/II primary colon cancer. The median follow-up time was 53 months (IQR 37-59). Clinical and histopathological characteristics are listed in Table 1. Circular RNAs were defined as present when at least five reads crossed the circular junction [36]. This resulted in the identification of 2606 distinct circRNAs in the entire cohort, of which 1860 were derived from known genes. Sixty-three percent of these were repeatedly occurring in at least two colon cancer samples (n = 1172) (Figure 1), whereas 277 (15%) were observed in 20 samples or more (Table S2). The most repeatedly occurring circRNAs were derived from SMARCA5, HIPK3, ZKSCAN1 and FBXW7 and were observed in 177 samples each (n.b. not the same 177 samples for all four circRNAs) ( Table 2). For 29 genes we observed that more than one unique circRNA was derived from the linear sequence (Table S2). Relative to the total number of samples expressing at least one circRNA from the respective gene, varying levels of co-expression between circRNAs derived from the same gene were observed with a median of 26.19% (range: 4.40-82.30%). Circular RNAs were defined as present when at least five reads crossed the circular junction [36]. This resulted in the identification of 2606 distinct circRNAs in the entire cohort, of which 1860 were derived from known genes. Sixty-three percent of these were repeatedly occurring in at least two colon cancer samples (n = 1172) (Figure 1), whereas 277 (15%) were observed in 20 samples or more (Table S2). The most repeatedly occurring circRNAs were derived from SMARCA5, HIPK3, ZKSCAN1 and FBXW7 and were observed in 177 samples each (n.b. not the same 177 samples for all four circRNAs) ( Table  2). For 29 genes we observed that more than one unique circRNA was derived from the linear sequence (Table S2). Relative to the total number of samples expressing at least one circRNA from the respective gene, varying levels of co-expression between circRNAs derived from the same gene were observed with a median of 26.19% (range: 4.40-82.30%).     We correlated the number of circRNA reads per circRNA with the expression of the linear gene from which the circRNA was derived. To avoid possible spurious correlations, only the 277 circRNAs found in at least 20 samples were analyzed. The vast majority of circRNAs showed a positive correlation with the linear gene from which the circRNA is derived ( Figure 2). However, this correlation was poor (R < 0.3) for 126 circRNAs. Twenty-seven circRNAs showed a negative correlation with their corresponding linear gene, suggesting the circRNA may function independently from the linear transcript.
We correlated the number of circRNA reads per circRNA with the expression of the linear gene from which the circRNA was derived. To avoid possible spurious correlations, only the 277 circRNAs found in at least 20 samples were analyzed. The vast majority of circRNAs showed a positive correlation with the linear gene from which the circRNA is derived ( Figure 2). However, this correlation was poor (R < 0.3) for 126 circRNAs. Twentyseven circRNAs showed a negative correlation with their corresponding linear gene, suggesting the circRNA may function independently from the linear transcript.  [40]). The identified junctions in the RNAseq data were verified for all four selected circRNAs by Sanger sequencing, thereby demonstrating the validity of our circRNA identification algorithm [36] (Figure 3).

CircRNA Expression Patterns Are Associated with Relevant Clinical Factors
For a substantial number of genes, only a linear transcript is detected in the majority of samples, which results in many missing values per circRNA. This in turn complicates the use of standard cluster analysis for the identification of circRNA/sample groups with similar circRNA-related biology. To be able to investigate circRNA profiles, we categorized circRNAs as "present" or "absent" in a sample and used this in a multiple correspondence analysis (MCA) to find naturally occurring subgroups. An MCA plot projects the colon tumor samples and circRNAs onto the same plane, in which the relative distance to either the samples or the circRNAs is meaningful. As such, samples that group close together have more similar circRNA profiles. In addition, since circRNAs have two states (present/absent), both these states are used in the analysis. Thus, two circRNAs that are "present" frequently in the same samples (co-occurrence) will be placed at a short distance, but this is also true for circRNAs that are mutually exclusive (presence of a  [40]). The identified junctions in the RNAseq data were verified for all four selected circRNAs by Sanger sequencing, thereby demonstrating the validity of our circRNA identification algorithm [36] (Figure 3).

CircRNA Expression Patterns Are Associated with Relevant Clinical Factors
For a substantial number of genes, only a linear transcript is detected in the majority of samples, which results in many missing values per circRNA. This in turn complicates the use of standard cluster analysis for the identification of circRNA/sample groups with similar circRNA-related biology. To be able to investigate circRNA profiles, we categorized circRNAs as "present" or "absent" in a sample and used this in a multiple correspondence analysis (MCA) to find naturally occurring subgroups. An MCA plot projects the colon tumor samples and circRNAs onto the same plane, in which the relative distance to either the samples or the circRNAs is meaningful. As such, samples that group close together have more similar circRNA profiles. In addition, since circRNAs have two states (present/absent), both these states are used in the analysis. Thus, two circRNAs that are "present" frequently in the same samples (co-occurrence) will be placed at a short distance, but this is also true for circRNAs that are mutually exclusive (presence of a circRNA and absence of the other circRNA) across the samples. Coloring the circRNA states will reveal the co-occurrence/mutual-exclusivity.
For the MCA analysis, we used the 277 repeatedly occurring circRNAs (i.e., those which are present in at least 20 samples) and labelled these in each sample as "present" or "absent" as defined above. After MCA analysis, we first colored genes based on the circRNA state ( Figure 4a). As shown by the clear separation of the "present" and "absent" state, circRNAs do not show mutual-exclusivity (which would show up as a red triangle among blue triangles or vice versa), but rather are often co-expressed in the same samples. circRNA and absence of the other circRNA) across the samples. Coloring the circRNA states will reveal the co-occurrence/mutual-exclusivity. For the MCA analysis, we used the 277 repeatedly occurring circRNAs (i.e., those which are present in at least 20 samples) and labelled these in each sample as "present" or "absent" as defined above. After MCA analysis, we first colored genes based on the circRNA state ( Figure 4a). As shown by the clear separation of the "present" and "absent" state, circRNAs do not show mutual-exclusivity (which would show up as a red triangle among blue triangles or vice versa), but rather are often co-expressed in the same samples.  (Figure 4b) have profiles that are close to the "absent" group (group without circRNAs), which indicates that few genes give rise to circRNA expression in these  (Figure 4b) have profiles that are close to the "absent" group (group without circRNAs), which indicates that few genes give rise to circRNA expression in these samples. Indeed, when analyzing circRNA diversity (the number of distinct circRNA molecules in a sample) we found that a high diversity in circRNAs is associated with a favorable DFS: Cox regression using circRNA diversity as (log-transformed) continuous variable: Hazard Ratio (HR) 0.60, 95% CI 0.38-0.97, p = 0.036. Figure 5 shows Kaplan-Meier curves in which the levels of diversity of circRNAs were split into three equal quantiles to visualize the association between circRNA-diversity and DFS. The difference in DFS between the three quantiles was evaluated using the logrank test for trend, to account for the ordered structure of the sample groups (high, intermediate and low circRNA diversity; p = 0.050).
High diversity was not associated with other clinical factors such as tumor stage, tumor side, MSI status, or CMS (diversity as continuous variable).
samples. Indeed, when analyzing circRNA diversity (the number of distinct circRNA molecules in a sample) we found that a high diversity in circRNAs is associated with a favorable DFS: Cox regression using circRNA diversity as (log-transformed) continuous variable: Hazard Ratio (HR) 0.60, 95% CI 0.38-0.97, p = 0.036. Figure 5 shows Kaplan-Meier curves in which the levels of diversity of circRNAs were split into three equal quantiles to visualize the association between circRNA-diversity and DFS. The difference in DFS between the three quantiles was evaluated using the logrank test for trend, to account for the ordered structure of the sample groups (high, intermediate and low circRNA diversity; p = 0.050). High diversity was not associated with other clinical factors such as tumor stage, tumor side, MSI status, or CMS (diversity as continuous variable).  The MCA plot of tumor side (Figure 4e) shows that right-sided tumors are closer to the absent group (group of samples without circRNAs) than the left-sided tumors -therefore also closer to the relapse group, but this association was not significant. Combining CMS grouping ( Figure 4d) and MSI (Figure 4f) shows that, as expected, samples that are CMS1 and those that are MSI tumors have a similar position. Combining CMS (Figure 4d) and circRNA diversity (Figure 4a) leads to the conclusion that CMS3 patients have the highest diversity in circRNAs, and CMS2 patients the lowest. As to tumor stage, there was no clear distinction between stage I and II tumors with regard to circRNAs (Figure 4e).
Combining CMS grouping (Figure 4d) and MSI (Figure 4f) shows that, as expected, samples that are CMS1 and those that are MSI tumors have a similar position. Combining CMS (Figure 4d) and circRNA diversity (Figure 4a) leads to the conclusion that CMS3 patients have the highest diversity in circRNAs, and CMS2 patients the lowest. As to tumor stage, there was no clear distinction between stage I and II tumors with regard to circRNAs (Figure 4e). Next to this global analysis of overall circRNA profiles in the samples, we also investigated whether the presence/absence of specific circRNAs was associated with relapse status, tumor stage, CMS, tumor localization, and MSI. Whereas no specific circRNAs were significantly associated with tumor stage and localization, the presence/absence of nine circRNAs was associated with CMS and five with MSI (Table 3; Fisher exact test p < 0.0003; Benjamini-Hochberg corrected p-value < 0.1). circSATB2 (Special AT-rich sequence-binding protein 2), circNAP1 (Nucleosome assembly protein) and circCEP192 (Centrosomal Protein 192) each correlated with both MSI and CMS. Only absence of circMGA (MAX dimerization protein) was significantly associated with relapse (Table 3; Fisher exact test p = 0.0002; Benjamini-Hochberg corrected p-value = 0.06). Kaplan-Meier analysis showed that patients in whom circMGA was detected (n = 94) had a favorable DFS compared to patients in which circMGA was not detected (n = 87; logrank test p < 0.001, Cox HR 0.22 95%CI 0.09-0.53, p < 0.001) ( Figure 6). Next to this global analysis of overall circRNA profiles in the samples, we also investigated whether the presence/absence of specific circRNAs was associated with relapse status, tumor stage, CMS, tumor localization, and MSI. Whereas no specific circRNAs were significantly associated with tumor stage and localization, the presence/absence of nine circRNAs was associated with CMS and five with MSI (Table 3; Fisher exact test p < 0.0003; Benjamini-Hochberg corrected p-value < 0.1). circSATB2 (Special AT-rich sequence-binding protein 2), circNAP1 (Nucleosome assembly protein) and circCEP192 (Centrosomal Protein 192) each correlated with both MSI and CMS. Only absence of circMGA (MAX dimerization protein) was significantly associated with relapse (Table 3; Fisher exact test p = 0.0002; Benjamini-Hochberg corrected p-value = 0.06). Kaplan-Meier analysis showed that patients in whom circMGA was detected (n = 94) had a favorable DFS compared to patients in which circMGA was not detected (n = 87; log-rank test p < 0.001, Cox HR 0.22 95%CI 0.09-0.53, p < 0.001) ( Figure 6).

Discussion
With the use of RNAseq data, we could establish the presence of a wide variety of circRNAs in chemonaive lymph node negative, stage I/II primary colon tumors. Previous

Discussion
With the use of RNAseq data, we could establish the presence of a wide variety of circRNAs in chemonaive lymph node negative, stage I/II primary colon tumors. Previous studies have been limited by the small number of circRNAs screened, the small sample size and retrospective data. Our study, however, concerned 181 patients included in a prospective, multicenter cohort study, and is therefore, to our knowledge, the largest circRNA-based biomarker discovery study done in stage I/II colon cancer.
The four most repeatedly occurring circRNAs that we found (177/181 samples), circSMARCA5, circHIPK3, circFBXW7 and circZKSCAN1, have also been described as such in previous studies. circSMARCA5 was reported to be induced during epithelial-tomesenchymal transition, which is an important mechanism during the metastatic process that has been associated with the pathogenesis of several cancers [26,[41][42][43][44]. circHIPK3 has been described to promote CRC growth and metastasis by sponging miR-7 [45]. Furthermore, previous research in CRC cell lines showed that circFBXW7 is conducive in controlling the progression of CRC through NEK2, mTOR, and PTEN signaling pathways [37]. The correspondence of our finding with previous results clearly underlines the validity of our approach in identifying circRNAs. In addition, we performed Sanger sequencing to verify four randomly selected circRNAs (circSATB2, circKMT2C circFGD6, and circPLEKHM3) and successfully validated the identified circular junctions for all four circRNAs.
In the studied cohort of chemonaive lymph node negative colon cancer patients, a first highlight was the finding that high diversity of circRNAs present in colon cancer tissue was associated with favorable DFS. Vo et al showed that across different cancer types, total circRNA abundance was lower in cancer compared to normal tissue, suggesting that the reduction of circRNA generation could be associated with loss of cellular differentiation [46]. More specifically, presence of circMGA was significantly associated with a favorable DFS. Together, these findings support the idea that circRNAs might play a functional role in cancer metastasis [26]. Recent studies provide evidence for a tumor suppressive role for the gene MGA (MAX dimerization protein) in colorectal cancer [47]. In lung adenocarcinoma, the molecular function of MGA appears to be antagonistic to that of MYC. To our knowledge, this is the first study associating the circRNA emanating from this gene with colon cancer or any other malignancies.
A second highlight of this study is the association between circRNAs and distinct colorectal cancer subtypes. Presence/absence of nine and five circRNAs was significantly associated with CMS and MSI, respectively, of which circSATB2, circNAB1, circCEP192 were overlapping. Although we were unable to find a suitable publicly available RNAseq dataset to validate the associations we found between circRNAs and clinical parameters in our cohort of stage I-II colon cancer patients, a number of the circRNAs we found to be associated with distinct subtypes of colon cancer were described before in cancer. circSATB2 has been described to play a notable role in the progression of lung cancer by binding to miR-326 [48], which in turn is associated with CRC [49]. The association between CEP192, NAB1 and CRC or other cancers, has not been described in previous studies. A role in CRC was proven for circZNF609 (Zinc Finger Protein 609), which is down-regulated in CRC tissue and promotes apoptosis in CRC by upregulating p53 [50]. circUBAP2 (ubiquitin associated protein 2) facilitates CRC progression by sponging miR-199a to upregulate VEGFA which implies that circUBAP2 may be a potential therapeutic biomarker for CRC [51]. Furthermore, circZBTB44 (Zinc Finger and BTB Domain Containing 44) and circZNF609 are both upregulated in acute lymphoblastic leukemia [52] of which especially circZNF609 has a known oncogenic potential in multiple other cancers as well [53][54][55][56][57]. CircASPH (Aspartate Beta-Hydroxylase) expression is upregulated in lung adenocarcinoma [58] and, finally, circFUT8 (Fucosyltransferase 8) functions as a tumor suppressor in bladder cancer cells where low circFUT8 was associated with poor prognosis, high histological grade, and lymph node metastasis [59]. The largest strength of this study is its prospective, multicenter study design and that it is, to our knowledge, the largest circRNA-based biomarker discovery study performed in stage I/II colon cancer. However, as mentioned before, a limitation of this study is that we were unable to find a suitable publicly available dataset to validate the associations we found between circRNAs and clinical parameters. Furthermore, some of the subgroup analyses, such as MSI, resulted in rather small sample sizes in outcome, increasing the chance of type II errors.

Conclusions
In conclusion, this study generated a comprehensive catalog of circRNAs in colon cancer and demonstrated the potential biological and clinical relevance of circRNAs in patients with stage I-II colon cancer. We demonstrated high diversity in circRNAs is associated with favorable DFS. As such, circRNAs represent a promising addition to the biomarker repertoire for colon cancer.