Targeted genomic screen reveals focal long non-coding RNA copy number alterations in cancer

The landscape of somatic copy-number alterations (SCNAs) affecting long non-coding RNAs (lncRNAs) in human cancer remains largely unexplored. While the majority of lncRNAs remains to be functionally characterized, several have been implicated in cancer development and metastasis. Considering the plethora of lncRNAs genes that is currently reported, it is conceivable that several lncRNAs might function as oncogenes or tumor suppressor genes. We devised a strategy to detect focal lncRNA SCNAs using a custom DNA microarray platform probing 20 418 lncRNA genes. By screening a panel of 80 cancer cell lines, we detected numerous focal aberrations targeting one or multiple lncRNAs without affecting neighboring protein-coding genes. These focal aberrations are highly suggestive for a tumor suppressive or oncogenic role of the targeted lncRNA gene. Although functional validation remains an essential step in the further characterization of the involved candidate cancer lncRNAs, our results provide a direct way of prioritizing candidate lncRNAs involved in cancer pathogenesis.


Introduction
The cancer genome is marked by large numbers of genetic and non-genetic alterations. The greater majority of those are somatic. Only a small fraction of the somatic mutations, the so-called driver mutations, contribute to cancer development by activating or inactivating specific cancer genes. The remainder are passenger mutations that do not confer growth advantage but were acquired at some point during cancer cell proliferation (1). Differentiating between driver and passenger mutations is one of the biggest challenges in the quest for new cancer genes and putative therapeutic targets. While somatic alterations can be as small as a single nucleotide substitution, insertion or deletion, somatic copy-number alterations (SCNA) affect the largest fraction of the genome (2). In some cases, SCNA affect entire or partial chromosome arms. The ability to detect these genetic/genomic alterations using (molecular) cytogenetic methods has made large SCNA historically the best studied cancer associated genetic alterations. Many well-known oncogenes and tumor suppressor genes have been initially identified as targets of recurrent genomic amplifications or deletions, respectively. Notable examples are tumor suppressor genes PTEN(3) and RB1(4) and oncogenes HER2 (ERBB2) (5) and the MYC-family of transcription factors (6,7). The resulting diagnostic and therapeutic successes have made cancer SCNA subject of many studies. Additionally, the advent of array comparative genome hybridization (array CGH) platforms that enable robust identification of small SCNAs greatly improved our knowledge of the cancer genome (8)(9)(10).
As cancer genetics until now mainly focused on protein-coding genes, not much is known on SCNAs affecting non-coding RNA genes in cancer. In recent years, our knowledge on the non-coding genome has expanded enormously. This is especially the case for the class of long non-coding RNAs (lncRNAs), consisting of genes with transcripts larger than 200 nucleotides that do not encode proteins. In the past 5 years, ten thousands of human lncRNAs have been reported and catalogued, making this the largest genetic class in the human genome (11). While the bulk of lncRNAs remains to be functionally annotated, they have been implicated in many important normal cellular processes such as dosage compensation (12), chromatin remodeling (13), and cell differentiation (14); when deregulated, they play a role in disease as well, including cancer (15).
The discovery of cancer associated lncRNAs such as HOTAIR (16), MALAT1(17) and PVT1 (18) uncovered an important role for lncRNAs in oncogenesis. The reason for the current hiatus in our knowledge on lncRNA SCNAs is the fact that the majority of lncRNA annotations are very recent. Most commercially available platforms are based on older genomic annotations (with no probes for lncRNAs, or probes for as yet unannotated lncRNAs) or lncRNAs are simply overlooked in the data analysis. Indeed, recurrent SCNAs outside of protein coding regions have been reported (2,19). To overcome this problem, existing DNA microarray platforms have been repurposed and probe content was reannotated with current lncRNA annotation (20,21). One such effort resulted in the discovery of the oncogenic FAL1 (focally amplified lncRNA on chromosome 1) lncRNA in ovarian cancer (21). While the potential of this approach lies in its ability to make use of the large amount of publically available DNA microarray data, the used platforms have several disadvantages for the discovery of putative cancer associated lncRNAs. Whole cancer genome sequencing has the potential in principle to circumvent these limitations, but the method is still relatively expensive, and challenging in terms of data-analysis. Consequently, public databases (e.g. TCGA) are mainly populated with targeted exome sequencing datasets, again focusing on protein coding genes.
Here we present a targeted and cost-effective approach to identify focal lncRNA SCNA based on a custom DNA microarray covering 20 418 lncRNA transcripts and their flanking protein coding genes. We show the ability of this platform to detect focal aberrations that only affect lncRNA exons and not encompass their flanking protein coding genes. By analyzing the DNA of 80 cancer cell lines covering 11 cancer subtypes we reveal that lncRNAs are frequently targeted by focal aberrations in human cancer. In addition, we have generated a dataset with putative oncogenic and tumor suppressor lncRNAs for future functional studies.

A targeted platform to detect focal copy number changes in lncRNA genes
LncRNAs are underrepresented on commercial array CGH platforms and the mean chromosomal distance between the probes on these arrays makes them unsuitable to detect small aberrations that only involve (part of) a single lncRNA gene ( Figure S1, Supplementary material).
In order to detect small and focal SCNAs that only affect lncRNA exons, we designed a custom 180k CGH array covering intergenic lncRNA exons and the nearest exons of their flanking protein coding genes. To this purpose, we constructed a database with 52 324 non-redundant exons derived from all transcripts listed in LNCipedia (Figure 1, Figure S2 and Figure S3). The database was subsequently extended with protein coding gene annotation from Ensembl. Next, we designed probes using the genomic sequence of the lncRNA exons and the two nearest exons of the flanking protein coding genes. By removing duplicate probes in overlapping exons and selecting additional probes for transcripts with fewer exons, we were able to cover the majority (94%) of the transcripts with at least 10 probes ( Figure S4). Only 1.2% of lncRNAs could not be covered by any probe. For 95% of the lncRNA transcripts we succeeded in designing 2 probes for each flanking protein coding exon.
To assess the quality of our custom array CGH platform, we compared the profiles for

Frequent focal lncRNA copy number alterations in cancer cell lines
To explore focal lncRNA SCNAs in cancer, we analyzed DNA from 80 cancer cell lines covering 11 cancer subtypes with our custom DNA microarray (Table 1). An extensive filtering was performed on the resulting segments to shortlist focal lncRNA SCNA alterations. To be considered a lncRNA SCNA, a segment should (1) overlap with exonic lncRNA sequence, (2) not be contained within known segmental duplications,

RT-qPCR confirms the majority of focal aberrations
We devised a unique strategy to validate the selected focal lncRNA SCNAs using qPCR. Assays were designed targeting the genomic locus of the aberration and the nearest exons of the flanking protein coding genes. By comparing the Cq value of the lncRNA locus and the flanking coding exons, we can accurately assess the difference in copy number between the two. Using this strategy, we evaluated 88 events ( Figure   3). For 66 of these (75%) an altered copy number status compared to at least one of the two flanking assays could be confirmed, of which 43 (49%) showed the expected relative difference in Cq values with both flanking assays and were thus validated as focal aberrations. The validation rate is higher for the amplifications than for the deletions (56% and 48%, respectively). The validation rate drastically increases when we limit our analysis to the subset of segments with an absolute average log-ratio larger than 2.5. In that case, 58 out of 64 (91%) events are confirmed copy number alterations. The fraction of confirmed focal aberrations remains similar (53%).

Most novel lncRNA aberrations do not correspond to common somatic variants
As our custom platform differs considerably from other array CGH platforms, it not unlikely that the newly found SCNAs actually comprise uncharted germline copynumber variants that may exist in a normal population and do not contribute to cancer.
To assess this possibility, we performed an RT-qPCR experiment for five validated loci on DNA from 192 healthy individuals. Neither homozygous deletions nor high order amplifications could be detected for any lncRNA in any of the samples ( Figure S8). Of note, for one lncRNA heterozygous deletions were found in 12 individuals (6%).

Discussion
Even though the number of samples we examined is limited and confined to cell lines, we were able to detect a large number of SCNA that specifically affect lncRNA exons.
This suggests that similarly to protein-coding genes, lncRNAs are frequently targeted by SCNAs in cancer. After rigorous filtering focused on novel highly aberrant segments that not encompass protein coding genes, we report 136 such events, including 25 that are recurrent. Of those, 76 events were marked as focal based on the copy number of the flanking protein coding genes. Since the cancer genome harbors many large SCNAs, it is important to also consider the events where the flanking protein coding genes are not strictly copy number normal. As long as the lncRNA itself is focally affected by a second event as well.
Our strategy uncovered several cancer-associated lncRNAs. For instance, the known oncogene lnc-MYC-2 (PVT1) was detected as a recurrent focal aberration (Figure 2, Figure S3). PVT1 has been implicated in several cancer types including gastric cancer (22), ovarian cancer and breast cancer (18). PVT1 was found to be co-amplified in more than 98% of cancers with a MYC copy number increase(23). Our work not only confirms frequent amplification of PVT1 in cancer, but also reveals that PVT1 amplifications can be focal. Another interesting accordance with previous studies is found in a large-scale pan-cancer study on SCNAs (19). Although the authors mainly focus on SCNAs affecting protein coding genes and use limited lncRNA annotation, they report one lncRNA, lnc-DCTD-5 (LINC00290), as the sole member of a frequently deleted region. Our results reveal a recurrent and focal deletion in ovarian and breast cancer cell lines, suggesting a role in cancer ( Figure 2).
The validation rate determined with qPCR was strongly dependent on the log-ratio cutoff applied to the segments, with an absolute average log-ratio larger than 2.5 showing high validation rates for lncRNA copy number status. The relatively high cutoff is likely to be related to the unique design of our platform. As the probes are confined to small genomic loci (lncRNA exons) is it not unimaginable that the observed signalto-noise ratio is different compared to typical designs. In addition, qPCR may not be the most appropriate method to detect hemizygous copy number changes. Even with a stringent log-ratio cutoff (2.5), only 50% of the events could be confirmed to be truly focal. This suggests that the limited number of probes on the flanking protein coding genes is insufficient to define the breakpoints of the segments in some cases.
Nevertheless, even when taking the validation rate into account, our research finds about 100 lncRNAs affected by focal SCNA. As the majority of these events are likely no germline copy-number variants, these SCNAs harbor interesting candidates for further research.

Conclusion
We developed and applied a unique array CGH platform capable of detecting small and focal lncRNA SCNAs. We have screened a panel of 80 cancer cell lines and shortlisted 136 lncRNA genes with a putative role in cancer. Among this list are several lncRNAs that have been implicated in cancer, validating our approach. Since the great majority of the lncRNAs on our platform have yet to be functionally studied, this finding suggests that our research provides many new cancer related lncRNA genes. We present a set of lncRNA genes to the lncRNA and cancer research community as novel candidate cancer lncRNA genes for further functional exploration.

Array CGH platform design
Array CGH probe design was performed using Agilent Technologies eArray software a .
A BED file of all non-redundant exons was generated from the exon database and uploaded into eArray for probe design. Since our criterion to have 2 probes per exon was initially not met, the exon boundaries were extended and the corresponding BED files were uploaded as well. Exon boundaries are extended with 100 bp, 300 bp and 500 bp. In addition, less stringent selection parameters were used for the 500 bp extended exon. In this way, 5 probe datasets were generated and stored in a separate

Array CGH
400 ng of genomic DNA was labeled with Cy3-dCTP (GE Healthcare, Belgium) using a Bioprime array CGH genomic labeling system (Invitrogen, Belgium). In parallel, Kreatech gender-matched controls were labeled with Cy5-dCTP. Samples were hybridized on the custom array CGH arrays for 40 h at 65 °C. After washing, the samples were scanned at 5 µm resolution using a DNA microarray scanner G2505B (Agilent Technologies). The scan images were analyzed using the feature extraction software 9.5.3.1 (Agilent Technologies). Segmentation is achieved using the circular binary segmentation algorithm in the DNACopy R package. Visual inspection and creation of the copy number profile plots is performed with 'Vivar' (26). All raw array CGH data files are made publically available through the Gene Expression Omnibus (GEO) website using the accession number GSE85444.

Segment analysis and filtering
Segment position and statistics are stored in a MongoDB collection. A perl script is used to combine the segment annotation with lncRNA and protein coding gene annotation in other collections and implement the filtering process. First, only segments that overlap lncRNA exons are retained. Next, segments with an absolute average log-ratio less than 1.5 are discarded as are segments contained within segmental duplications (UCSC genomicSuperDups track) or segments that overlap with more than 3 known variants (database of genomic variants (27)). The absolute logratio of the nearest segments covering the flanking protein coding genes should be 0.5 lower than the segment covering the lncRNA (corresponding to about 1 copy less). A more stringent subset of segments is obtained by requiring the absolute log-ratio of the nearest segments covering the flanking protein coding genes to be less than 0.35 (copy number neutral).

Conflict of interest
The authors declare no conflict of interest. Tables   Table 1: