CirComPara: A Multi-Method Comparative Bioinformatics Pipeline to Detect and Study circRNAs from RNA-seq Data

Circular RNAs (circRNAs) are generated by back-splicing of immature RNA forming covalently closed loops of intron/exon RNA molecules. Pervasiveness, evolutionary conservation, massive and regulated expression, and post-transcriptional regulatory roles of circRNAs in eukaryotes have been appreciated and described only recently. Moreover, being easily detectable disease markers, circRNAs undoubtedly represent a molecular class with high bearing on molecular pathobiology. CircRNAs can be detected from RNA-seq data using appropriate computational methods to identify the sequence reads spanning back-splice junctions that do not co-linearly map to the reference genome. To this end, several programs were developed and critical assessment of various strategies and tools suggested the combination of at least two methods as good practice to guarantee robust circRNA detection. Here, we present CirComPara (http://github.com/egaffo/CirComPara), an automated bioinformatics pipeline, to detect, quantify and annotate circRNAs from RNA-seq data using in parallel four different methods for back-splice identification. CirComPara also provides quantification of linear RNAs and gene expression, ultimately comparing and correlating circRNA and gene/transcript expression levels. We applied our method to RNA-seq data of monocyte and macrophage samples in relation to haploinsufficiency of the RNA-binding splicing factor Quaking (QKI). The biological relevance of the results, in terms of number, types and variations of circRNAs expressed, illustrates CirComPara potential to enlarge the knowledge of the transcriptome, adding details on the circRNAome, and facilitating further computational and experimental studies.


Introduction
Circular RNAs (circRNAs) are generated from immature RNA by a process called back-splicing where the 3 and 5 ends of linear RNA molecules are covalently joined in a non-collinear way forming RNA-loops. It has been estimated that circRNAs are produced from more than 10% of genes [1]. Circularity confers specific properties to circRNAs: they have longer half-lives compared with linear RNAs [2], tend to accumulate in cells with a low proliferation rate [3], and are resistant to RNase R. RNA circularization can engage single exons, two or more exons [4], both exon and intron sequences, or intronic sequences only [5]. A given gene can generate several circular isoforms and these isoforms may show distinct expression profiles [6]. Generation of circRNAs happens at the expense of their corresponding linear RNA isoforms and a correlation between exon skipping and circularization [7] has been demonstrated. Back-splicing adds complexity to alternative RNA splicing, and circRNA biogenesis and splicing are clearly interleaved processes. Thus, mutations or

CirComPara Provides circRNA Detection, Quantification, Annotation and Correlation with Gene Expression
We designed and implemented CirComPara (http://github.com/egaffo/CirComPara), a bioinformatics pipeline that allows detecting, quantifying and studying circRNAs from RNA-seq data. CirComPara quantifies linear transcript expression and, in the subsequent phase, it uses in parallel different methods to identify expression of putative circRNAs. The circRNAs are selected according to a combination of the results of circRNA discovery methods, and quantified with normalized estimates. Ultimately, CirComPara annotates the circRNAs in terms of overlapping genes and provides expression correlation measures of circRNA to overlapping genes or linear counterparts. CirComPara yields results in tabular format to ease custom downstream analysis, as well as different default analysis results in HTML pages with several informative display items, such as statistics on circRNAs types, features and expression, and descriptive analyses of samples in terms of circRNA and gene expression profiles, and correlations.
In the next paragraphs, we first describe the software characteristics and then, by a demonstrative application on real data, we show CirComPara's output.

CirComPara's Default Workflow and Usage
A schematic overview of the CirComPara workflow architecture for the analysis of a single sample is represented in Figure 1A. The required input for the pipeline are RNA-seq reads from Illumina sequencing (both single-and paired-end reads supported), the reference genome in multi FASTA format, and the gene/transcript annotation in GTF format (as retrieved from genomic databases like Ensembl). In a typical analysis, the researcher needs to compile a metadata table specifying the sample files and their respective raw read file locations. In addition, the file location of the reference genome and annotation files, which can be stored in a project specification file (vars.py file), together with other non-default parameters chosen.
For each sample, CirComPara first pre-processes the raw reads to retain only high quality fragments that will be used in downstream analysis. In parallel, CirComPara builds the files required by each method, such as genome indexes and specific formats of annotation. Next, a preliminary read alignment to the (linear) reference genome is performed with strict criteria, especially important for paired-end reads: unpaired and discordantly aligned reads are considered unmapped and thus separated from linear aligned reads. The alignments are used to detect and quantify linear transcript and gene expression, which are computed for each sample both as raw counts and normalized values (fragments per kilobase per million mapped reads; FPKM). Reads that fail to be aligned to the reference genome are used as candidates for the detection of back-splice junctions and are given as input to each circRNA detection method. Multiple methods for circRNA detection and quantification can be selected: at the time of this writing, CirComPara uses, as a default setting, four different circRNA detection tools that ground on different strategies (CIRCexplorer, CIRI, find_circ, and testrealign) and unfiltered outputs of each detection tool are saved separately to be available for custom analyses. CircRNAs are distinguished and named in terms of the back-splice genomic positions that identify them. By this approach, predictions from the different methods and samples can be compared and enhanced by defining the set of the "reliable circRNAs", which, by default, are the circRNAs expressed with at least two back-splice reads and jointly detected by at least two methods.
CircRNAs' back-splice genomic coordinates are compared against the linear transcripts' annotation to relate the genes and genomic features that generated circular isoforms. The overlap with gene annotation allows characterizing back-splices according to transcript structure: exonic, intronic, and intergenic circRNAs are the main classes that can be distinguished. Finally, expression levels of the reliable circRNAs (normalized according to reads per million mapped, RPM) are correlated to the overlapping genes' normalized expression estimates and reported in tabular format. Round corner boxes represent inputs; currently used tools are represented by gray labels next to the relative pipeline level; dotted lines represent optional functions; (B-D) CirComPara summary plots of circular RNAs (circRNAs) expressed; (B) absolute number of circRNAs detected by each method and (C) commonly detected by two or more methods; (D) number of circRNAs expressed per sample, considering the whole set of detected back-splices and the selected subset of circRNAs detected by at least two methods.

CirComPara's Output
CirComPara saves a rich set of results in separate directories. Results of the main analysis steps and objectives are saved in tabular text files, which are useful for data interpretation and post-processing, and offer the user the possibility to examine the complete output. On top, an HTML document reports a detailed summary, with the most relevant results presented in aggregated forms and displayed as tables and figures.
The HTML output file (Supplementary File 1) comprises four different sections. The first and the last sections give technical details, whilst the two central sections, with several subsections each, provide the core information. The first section summarizes the analysis run by listing number and IDs of RNA-seq samples, and the back-splice detection methods used for the combined analysis. The second section is dedicated to statistics and plots reporting the circRNAs detected by at least one

CirComPara's Output
CirComPara saves a rich set of results in separate directories. Results of the main analysis steps and objectives are saved in tabular text files, which are useful for data interpretation and post-processing, and offer the user the possibility to examine the complete output. On top, an HTML document reports a detailed summary, with the most relevant results presented in aggregated forms and displayed as tables and figures.
The HTML output file (Supplementary File 1) comprises four different sections. The first and the last sections give technical details, whilst the two central sections, with several subsections each, provide the core information. The first section summarizes the analysis run by listing number and IDs of RNA-seq samples, and the back-splice detection methods used for the combined analysis. The second section is dedicated to statistics and plots reporting the circRNAs detected by at least one program, and circRNAs found only by one method or commonly detected by two or more methods. Next, only bona fide more robust circRNAs detected by at least two methods (reliable circRNAs) are considered for further elaborations, including: circRNA category annotation according to back-splice end positions in relation to known exons or introns of overlapping genes; number of circRNAs expressed by genes; and expression analyses, also in comparison with gene expression profiles. The last section reports names and versions of the software and packages used in the CirComPara analysis run.

Additional Options and Features
CirComPara is a flexible tool that allows output stringency modulation through different setting of parameters. CirComPara by default applies all four circRNA-discovery methods implemented, but different combinations of methods can be specified. In fact, the user can select which circRNA tools to run (at the time of this writing, from one to up to four). For instance, the user may choose only a pair of methods, such as CIRI and find_circ, or find_circ and testrealign; only three methods, such as CIRCexplorer, CIRI, and find_circ; or even only a single method. In addition, the minimum number of back-splice reads and methods jointly predicting a circRNA, which define the reliable circRNA set, can be changed by the user to tune CirComPara's output. For instance, by setting at least one read and at least one method, CirComPara maximizes its sensitivity, possibly reducing false negative predictions. Conversely, requiring all methods to jointly detect the circRNAs (the most restrictive setting) will possibly reduce the number of false positive predictions.
Moreover, CirComPara can run with alternative workflows. Specifically, raw read preprocessing can be bypassed; already computed genome indexes can be given as input, thus skipping the automatic genome indexes building; and novel genes and transcript isoforms can be inferred from the data when the transcriptome reconstruction option is enabled.
CirComPara optimizes the computational performance by allowing parallel computation at two levels: the former regards a multithreaded run of the tools allowing parallel computing. The latter regards simultaneous execution of CirComPara's steps that are mutually independent, which maximizes the use of available computational devices when running the program.

CirComPara Predicts Thousands of circRNAs Expressed in Monocytes and Macrophages of a QKI Haploinsufficient Patient and Her Sibling
In the following, we present the pipeline output obtained from a run in default mode on a published dataset, as a demonstrative analysis. We analyzed the RNA-seq data produced by de Bruin et al. [25], which focused on the role of Quaking in linear pre-mRNA splicing in the context of monocytes to macrophage differentiation. As the original study did not interrogate circRNAs, our results illustrate CirComPara discovery power and value.
The dataset comprised four samples of primary monocytes from peripheral blood and experimentally induced differentiated macrophages from a QKI haploinsufficient patient (with 50% QKI mRNA and protein expression due to an unbalanced reciprocal translocation hitting the QKI gene) and her sibling (as control QKI wild-type).
Considering the four samples together, as many as 39,538 non-redundant back-splices were detected by at least one method, with quite different numbers of events detected by different methods from the same sequencing data ( Figure 1B): testrealign detected 34,049 back-splice events (86% of the total detected), CIRCexplorer detected 2924 events (7%), whereas CIRI and find_circ detected similar intermediate numbers of back-splices, respectively 6228 (16%) and 6920 (18%). In the QKI dataset, 5759 (14.6%) back-splices were detected by a least two methods ( Figure 1C) and 85% (33,779) by one method only. Hansen et al. [24] showed that, in RNase R treated libraries, the circRNAs detected by at least two methods are enriched with respect to circRNAs detected by only one method. Although circRNAs detected by only one method could represent true positives, in this sample analysis, we considered circRNAs detected independently by two or more methods, according to CirComPara's default settings, to reduce false positive calls. Thus, we obtained a set of fairly reliable 5759 circRNAs, which are expressed in monocytes and macrophages ( Figure 1D), with apparent quantitative and qualitative differences between QKI haploinsufficient and control samples and between monocytes and macrophages.
CircRNA normalized expression (Figure 2A) values ranged over five degrees of magnitude, reaching over 15,800 RPM in the control macrophages samples for the most abundant circRNA (chr11:33286413-3328751) expressed from HIPK3 (homeodomain interacting protein kinase 3). CircRNA median expression levels show small differences across the samples, which is also observed in median gene expression, even if with a different pattern (Figure 2A,B). Regarding the cumulative expression, 752 circRNAs on average accounted 75% of sample circRNA expression, whilst only 239 genes on average accounted 75% of sample gene expression ( Figure 2C,D). qualitative differences between QKI haploinsufficient and control samples and between monocytes and macrophages. CircRNA normalized expression (Figure 2A) values ranged over five degrees of magnitude, reaching over 15,800 RPM in the control macrophages samples for the most abundant circRNA (chr11:33286413-3328751) expressed from HIPK3 (homeodomain interacting protein kinase 3). CircRNA median expression levels show small differences across the samples, which is also observed in median gene expression, even if with a different pattern (Figure 2A,B). Regarding the cumulative expression, 752 circRNAs on average accounted 75% of sample circRNA expression, whilst only 239 genes on average accounted 75% of sample gene expression ( Figure 2C,D). To evaluate CirComPara's predictions, we compared predictions from our sample analysis to the 111,665 circRNAs reported by Hansen et al. [24], in which the circRNA expression estimate from fibroblast samples with and without RNase R treatment provided indication on circRNA prediction accuracy. CircRNAs were grouped into: enriched by the treatment (E), probably representing true To evaluate CirComPara's predictions, we compared predictions from our sample analysis to the 111,665 circRNAs reported by Hansen et al. [24], in which the circRNA expression estimate from fibroblast samples with and without RNase R treatment provided indication on circRNA prediction accuracy. CircRNAs were grouped into: enriched by the treatment (E), probably representing true circular forms; unvaried expression (U); and depleted after treatment (D), probably representing false predictions. The comparison showed that 80% (4619 out of 5759) of the circRNAs selected by CirComPara were also identified by Hansen et al. [24], with 91.6% (4233) of these being enriched, 4.3% (200) being unvaried, and 4.0% (186) being depleted. Table 1 reports the 30 most expressed circRNAs according to de Bruin et al. [25] data. All of these were also detected by Hansen et al. [24], and classified as enriched (26), or unvaried (only four). Furthermore, 14 circRNAs of Table 1 were reported in circBase [26] as experimentally validated in previous studies, including the circRNA from HIPK3 that was recently functionally characterized as a multiple miRNA sponge [27]. In addition, three of the most expressed circRNAs of Table 1 were reported and characterized in other independent studies [28][29][30].
The 5759 reliably expressed circRNAs resulted associated to 3123 annotated genes, with two circRNAs per gene in average and 42% of circRNAs overlapping genes associated with 2 to up to 22 different circRNAs each ( Figure 2E). Indeed, the PICALM gene (coding also the Phosphatidylinositol Binding Clathrin Assembly Protein) expresses 22 different circular isoforms, five of them being more abundant and expressed in all four samples. The large majority (5620, 97.6%) of reliable circRNAs resulted exonic and only few were "intergenic" or intronic (with back-splice ends falling outside annotated genes or annotated exons, respectively).
Expression correlation of circRNAs with genes presented a slight tendency toward positive values ( Figure 2F): the average and the median values of the 2149 computable pairwise Spearman correlations were 0.07 and 0.40, respectively, and 835 (39%) and 676 (32%) circRNA/gene pairs had a positive correlation over 0.5 or a negative correlation lower than −0.5, whereas 638 (30%) had an (absolute) weaker correlation.
In addition, we evaluated the number of circRNAs expressed in monocytes and macrophages from the QKI +/− and the control QKI +/+ sibling ( Figure 3A). The number of circRNAs expressed is lower (0.63×) in QKI +/− monocytes compared to the control, whereas, in macrophages, the number of circRNAs increases at a lower extent (1.15×) in the same comparison. In accordance with the original study of linear transcripts, and considering that no replicates were available, we calculated the log 2 FC of circRNA expression, considering for each cell type QKI+/− versus control values. The expression variation, represented in Figure 3B as waterfall plots of log 2 FC for all expressed circRNAs, is toward downregulation in QKI haploinsufficient monocytes and slightly toward downregulation in macrophages. The number of circRNAs showing a Log 2 FC over 1.5 or lower than −1.5 ( Figure 3C) is also informative and is in accordance with the above observations. Indeed, considering only those circRNAs with absolute log 2 FC > 1.5, we identified 865 and 1904 circRNAs up and downregulated when comparing QKI haploinsufficient monocytes. The same comparison regarding macrophages detected 1631 and 1290 up and downregulated circRNAs.
Finally, as an example of analysis that can be performed from CirComPara's output data, we report in Figure 3D the coordinates and expression values of three circular isoforms expressed from the QKI gene showing different abundance in normal monocytes and macrophages that appeared to be affected by the QKI haploinsufficiency. All three of these circRNAs resulted as being enriched by RNase R according to the data of Hansen et al. [24] and two of them (6:163455279-163535125; 6:163478780-163535125) were previously detected by Rybak-Wolf et al. [6].

Discussion
Research advancements on the analysis of RNA-seq data is generating new protocols [31] and improvements of software tools, including methods for circRNA detection [22]. CirComPara was designed and implemented with independent modules interfacing each other in hierarchical scripts. This gives CirComPara great flexibility for incorporating new features, such as additional methods; for updating the used software; and for upgrading to improved performing tools implemented in the pipeline steps.

Discussion
Research advancements on the analysis of RNA-seq data is generating new protocols [31] and improvements of software tools, including methods for circRNA detection [22]. CirComPara was designed and implemented with independent modules interfacing each other in hierarchical scripts. This gives CirComPara great flexibility for incorporating new features, such as additional methods; for updating the used software; and for upgrading to improved performing tools implemented in the pipeline steps.
CirComPara implements practices that, in our opinion, are the current best for the analysis of RNA-seq data and for circRNA discovery. The initial raw read pre-processing step for trimming and filtering low quality reads is highly recommended, although optional, since it was demonstrated to improve the alignment rate, decrease the timings of subsequent processing, and eventually improve the quality of results [32].
The strategy of mapping RNA-seq reads to the linear genome before circRNA identification, also used by other methods [14,33], provides the possibility to characterize gene or transcript expression, reduces computational load for circRNA discovery, and importantly allows for filtering out false positive back-splice findings.
The four circRNA detection methods considered ground on different search strategies used for read alignment to the reference genome, including Burrows-Wheeler [34][35][36] and suffix arrays [16,37] based methods. The combined use of different methods is expected to increase the sensitivity of the overall procedure to the price of an increased number of false positive predictions. The huge number of circRNAs reported by testrealign and the difference with the other methods (nearly five times more circRNAs) is probably due to the fact that, differently from the other methods, testrealign does not perform any alignment post-processing specific for circRNA identification. Such behavior has also been reported recently by [22]. However, by combining the methods' results and selecting the circRNAs commonly identified by more tools, our method is expected to reduce the number of loose predictions.
CirComPara includes a branch for de novo transcriptome reconstruction and quantification from RNA-seq data in parallel to circRNA detection. Importantly, this feature extends CirComPara application also to species without gene annotation or with scarcely annotated transcriptomes.
From a technical point of view, CirComPara relieves the user from the burden of many preparatory steps that are required by the different tools in the pipeline, such as the read aligners' genome index building and gene annotation formatting. Moreover, CirComPara provides plentiful access to each method's optional parameters. In addition to the above mentioned alternative analysis workflows, advanced users can set and combine the parameters to optimize performance and adapt the analysis to specific data. Nevertheless, the use of default parameters can bear good quality results, as it was reported in the present demonstrative analysis.
The sample analysis showed the discovery power of the pipeline and the richness and usefulness of obtained results. De Bruin et al. [25] provided evidence that QKI is induced during monocyte differentiation and plays a key role as a dynamic regulator of pre-mRNA splicing and expression profile changes that drive monocyte activation, adhesion and differentiation into macrophages. Our sample analysis identified numerous circRNAs expressed in monocytes and macrophages, with more than one circular isoform expressed by half of genes. Most of the selected set of 5759 circRNAs detected by two or more methods resulted to be exonic. Moreover, comparison with independent data revealed that almost 92% of these circRNAs are detected also in other tissues and enriched by RNase R treatment.
The expression of circRNAs were scattered across five orders of magnitude, with a relatively low number of elements expressed at high level. Although a direct comparison is difficult, it is worth noting that circRNA expression values distribution were less skewed than that of genes: 13% of circRNAs and 0.8% of genes account for three quarters of the total expression. Notably, almost all 30 most expressed circRNAs reported in Table 1 resulted either being enriched in RNase R treated libraries, and/or experimentally validated by independent studies, such as the top ranking circRNA we detected, which derives from the HIPK3 gene. A recent study [27] reported four circular isoforms of HIPK3, and showed that the predominant circHIPK3 is abundantly expressed in many tissues where it sponges nine different miRNAs, including the tumor suppressor miRNA miR-124. The same study confirmed circularity and stability of circHIPK3 and showed that the silencing of circHIPK3, but not HIPK3 mRNA, significantly inhibits human cell growth. The other two highly expressed circRNAs in Table 1, expressed by SPECC1 (17:20204333-20205912) and CDYL (6:4891713-4892379), were detected by Schneider et al. [28] that identified IMP3-associated circRNAs. Moreover, the circRNA ZNF609 (zinc finger protein 609) was recently shown to regulate AKT3 (AKT serine/threonine kinase 3) expression by sponging miR-150-5p. This evidence supports the validity of predictions selected by CirComPara.
This study provided a first glance on circRNA expression variations in monocyte to macrophage differentiation and in relation to QKI haploinsufficiency. Extending the results of de Bruin et al. [25], we showed indeed that also circRNA expression is affected by QKI, with a complex pattern during monocyte to macrophage differentiation, as observed for linear isoforms. QKI haploinsufficiency in monocytes seems to reduce the total number of circRNAs expressed respectively to the control and produce a larger number of downregulated circRNAs than of upregulated circRNAs. This result is in accordance with a previous study that showed that QKI promotes circularization [8] and also with de Bruin et al. [25] results on linear RNAs. Indeed, QKI haploinsufficiency in QKI peripheral blood monocytes was previously associated with significantly lowered expression of targets with QKI response elements (QRE) compared to non-targets (no QRE).
We showed that in macrophages the number and the average expression of circRNAs are moderately increased in relation to QKI haploinsufficiency. Also in this case, this is in accordance with the previous observation that macrophages of the haploinsufficient patient showed higher expression of mRNAs containing QREs relative to her sibling. Although preliminary, our results confirm that QKI regulates not only linear splicing but also circular RNA expression during monocyte to macrophage differentiation with a complex pattern. We could hypothesize that QKI regulates the expression of specific circRNAs directly impacting on observed expression patterns. Moreover, QKI regulation of circRNA expression could also be indirect, since QKI could modulate expression of linear RNA isoforms that act or encode trans-acting factors involved in circularization. Both direct and indirect interactions deserve further investigation.

Automation and Parallelism: Scons
The Scons building tool is the CirComPara's engine devoted to the automatic execution of the various scripts and tools implementing each step of the pipeline. Scons is a utility software conceived mainly for the compilation of source code in software production, yet its functions can handle tasks beyond standard compiling procedures.
Scons computes the tasks' execution dependencies, indeed allowing concurrent execution of independent tasks. For instance, CirComPara can simultaneously run the different methods for circRNA detection, but all of them must wait for the termination of the read alignment to the linear genome. In addition, the linear gene/transcript expression quantification is independent from detection of circRNAs, but gene expression to circRNAs' expression correlation computation depends on both of these steps.

CirComPara's Software Tools
We used existing methods to implement the analysis steps and developed custom scripts in Python, R and Bash when needed. At the time of this writing, only one method for read preprocessing is supported, Trimmomatic [38], while read statistics are implemented through FASTQC. For the linear genome mapping step, we chose HISAT2 [34] for its speed, accuracy of alignment, low computational requirements, and compatibility with downstream analysis tools, such as Cufflinks [39] and htseq-count [40], that carry out gene and transcript expression quantification. We restrict linear genome alignments by setting HISAT2 parameters to ensure linear and concordant alignments (-no-discordant -no-mixed options). Gene and transcript expression levels are computed with the Cufflinks tool suite, which is also used for the transcriptome reconstruction optional feature. CircRNA detection methods are: CIRCexplorer [14], which uses the STAR [37] aligner; CIRI [17], which uses the BWA-MEM [36] aligner; find_circ [10], which uses the Bowtie2 [35] aligner; and testrealign, from the segemehl [16] aligner. CircRNA annotation was computed by overlapping back splice genomic positions to gene annotation using BEDTools [41]. Analysis report was generated with custom R scripts using the data.table, ggplot2, and knitr R packages. Software versions used for the presented analysis are reported together with references in Table 2.
Adapter sequences used for read preprocessing were from the Trimmomatic file "TruSeq3-PE-2.fa". The analysis was performed on 64 cores AMD Opteron Processor 6380 with 512 GB of RAM Linux server running 64 bit Ubuntu Precise (12.04.5 LTS).

Conclusions
RNA-seq data have high discovery potential. The recent possibility of detecting circular RNAs using appropriate methods for back-splice identification extended the set of RNAs that can be studied with RNA-seq assays. CirComPara is a new tool that allows detecting, discovering, and quantifying circRNAs from RNA-seq data using four different methods in parallel for back-splice identification, reporting bona fide predictions from their result comparison. CirComPara also allows for annotating circRNAs in terms of overlapping genes, in order to quantify linear RNAs and gene expression, and to ultimately compare and correlate circRNA with gene and transcript expression levels. Thus, CirComPara is an original method providing substantial improvement for a computationally efficient integrative and comparative study of circular and linear transcriptome from RNA-seq experiments.