RNA Structure Elements Conserved between Mouse and 59 Other Vertebrates

In this work, we present a computational screen conducted for functional RNA structures, resulting in over 100,000 conserved RNA structure elements found in alignments of mouse (mm10) against 59 other vertebrates. We explicitly included masked repeat regions to explore the potential of transposable elements and low-complexity regions to give rise to regulatory RNA elements. In our analysis pipeline, we implemented a four-step procedure: (i) we screened genome-wide alignments for potential structure elements using RNAz-2, (ii) realigned and refined candidate loci with LocARNA-P, (iii) scored candidates again with RNAz-2 in structure alignment mode, and (iv) searched for additional homologous loci in mouse genome that were not covered by genome alignments. The 3’-untranslated regions (3’-UTRs) of protein-coding genes and small noncoding RNAs are enriched for structures, while coding sequences are depleted. Repeat-associated loci make up about 95% of the homologous loci identified and are, as expected, predominantly found in intronic and intergenic regions. Nevertheless, we report the structure elements enriched in specific genome elements, such as 3’-UTRs and long noncoding RNAs (lncRNAs). We provide full access to our results via a custom UCSC genome browser trackhub freely available on our website (http://rna.tbi.univie.ac.at/trackhubs/#RNAz).


The input alignment
mouse rat human hourse elefant manatee armadillo z e b r a fi s h Figure S1. Input Alignment Characteristics. For each species, the number of nucleotides from mouse that align with this species is plotted as a fraction of the mouse genome length.
As shown in Figure S1, well studies species with high quality genome assembly are apparently better aligned with mouse in the 60 way multiz input alignment than species with lower genome quality. In particular, we notice that a larger fraction of the mouse genome was aligned with human than with rodents (except for rat). Furthermore there are even alignment blocks which only contain mouse and zebrafish.
This additionally impairs interpretation of our results in terms of conservational deepness.

The False Discovery Rate (FDR) as a function of the input alignment
Although the RNAz class probability was calibrated to work as consistently as possible for different input alignments, our complex pipeline featuring a realignment step caused the final FDR to be highly dependent on different features of the input alignment. In a recent comparable screen by Seemann et al. [1] the calibration of the cutoff of their CMfinder score was done in a GC-dependent manner based on the FDR estimation. Since the RNAz class probability is reasonable well calibrated for GC content and overfitting has to be avoided, we choose the RNA class probability cutoff only based on the number of species in the alignment.
For the raw RNAz loci, the dependency of the FDR on the number of species ( Figure S2) is very strong. The lowest FDR is observed for alignments with 3 to 10 species, where the FDR lies between 20% and 30%. It is known that RNAz due to its dependency on RNAalifold has lower specificity if only 2 species are part of the input alignment, an effect which could not be compensated during the SVM calibration. For more than 10 species, we start to sample subsets of 10 species which will be classified by RNAz. Since a raw locus requires only one of 6 samples to be classified as RNA, we get a higher FDR if we sample. Furthermore, as the number of species in the input alignment increases, we can create more diverse samples, thus further increasing the chance to pick up random noise as signal.
We then looked at the FDR for alignments with only two species as a function of RNA class probability cutoff calculated by RNAz ( Figure S4). To achieve a FDR compareable to that of alignments with 3 to 10 species and a score cutoff of 0.5, we have to use a score cutoff of 0.99 for alignments with exactly two species.
For more than 10 species in the alignment, we observe that the number of samples which have to be classified as RNA by RNAz has an effect on the FDR (see Figure S3) that outweighs the effect of the score cutoff. If we count everything as hit where at least one or two samples are classified as RNA, we have a high FDR. By requiring at least 5 samples to be RNAz positive, we achieve a better FDR while retaining enough hits.