Automated Recognition of RNA Structure Motifs by Their SHAPE Data Signatures

High-throughput structure profiling (SP) experiments that provide information at nucleotide resolution are revolutionizing our ability to study RNA structures. Of particular interest are RNA elements whose underlying structures are necessary for their biological functions. We previously introduced patteRNA, an algorithm for rapidly mining SP data for patterns characteristic of such motifs. This work provided a proof-of-concept for the detection of motifs and the capability of distinguishing structures displaying pronounced conformational changes. Here, we describe several improvements and automation routines to patteRNA. We then consider more elaborate biological situations starting with the comparison or integration of results from searches for distinct motifs and across datasets. To facilitate such analyses, we characterize patteRNA’s outputs and describe a normalization framework that regularizes results. We then demonstrate that our algorithm successfully discerns between highly similar structural variants of the human immunodeficiency virus type 1 (HIV-1) Rev response element (RRE) and readily identifies its exact location in whole-genome structure profiles of HIV-1. This work highlights the breadth of information that can be gleaned from SP data and broadens the utility of data-driven methods as tools for the detection of novel RNA elements.

• Figure S1: Initialization of four Gaussian components using data percentiles.
• Figure S3: Secondary structures of the in vitro RREs.
• Figure S4: Sequences and pairing state paths of the SL III/SL IV region for RRE variants in the Sherpa set. • Figure S5: patteRNA scores on the Sherpa set of RRE SHAPE profiles when searching full-length RRE paths. • Figure S6: patteRNA scores for RRE motifs across four whole-genome HIV-1 structure profiles.
• Figure S7: Survival functions of c-scores for the 5SL and 4SL native structure of RRE across human transcriptome-wide PARS and HIV-1 SHAPE datasets. • Figure S8: patteRNA score ratios (5SL/4SL) for mixtures of the 5SL and 4SL native isomers of the RRE. • Figure S9: Comparison of trained models using an entire dataset and a reduced training subset.
• Table S1: patteRNA scoring of the SL III/SL IV region (nt 7409-7467) of RRE in genomic SHAPE data against the candidate paths described in the Sherpa set. Figure S1. Initialization of four Gaussian components using data percentiles. Grey histograms represent the distribution of example data. In this case, the parameter K = 2 (i.e., two components per pairing state) and each Gaussian component is represented by a solid line with blue indicating the two components used to model paired nucleotides, and red, unpaired ones. Gaussian means are spaced at regular percentile intervals, in this case at 20%, 40%, 60% and 80% of the data distribution density, respectively.
Site 1 Site 2 Figure S2. Illustration of sequence constraints. When comparing the target motif to the nucleotide sequence in Site 1, all base pairings follow the canonical rules (G-C, A-U, G-U allowed). This site consequently "passes" sequence constraints. On the contrary, the nucleotide sequence in Site 2 gives rise to non-canonical base pairings. Specifically, a G-A pairing is deemed invalid. As such, this site violates sequence constraints.  Figure S3. patteRNA's c-scores for the five paths A-E. Highlighted with a star is the score for the predicted path in the tested profile. (F) c-scores for the two native 5SL and 4SL isomers. Bars correspond to scores for paths A-E on the 5SL (black) and 4SL (grey) profiles. Similar to the other panels, stars highlight scores for the predicted path in each profile, namely path A for 5SL and path B for 4SL. Note that y-axes start at 1 to better highlight differences in c-scores between paths, which relate primarily to differences in 59 out of 232 nucleotides when searching the full-length path.

Watts Set (1M7) Target: Path B (4SL)
Siegfried  Figure S6. patteRNA scores for RRE motifs across four whole-genome HIV-1 structure profiles. c-scores for full-length paths A (5SL structure, left panels) and B (4SL structure, right panels) across all sites in the HIV-1 genome. Dataset and modifying reagents used are indicated in each panel and include the Watts set (SHAPE assayed with 1M7) and three profiles from the Siegfried set (SHAPE-MaP assayed with 1M6, 1M7, and NMIA, respectively). Peaks at nucleotide 7306 correspond to the known start location of the RRE. Figure S7. Survival functions of c-scores for the 5SL and 4SL native structure of RRE across human transcriptome-wide PARS and HIV1 SHAPE datasets. We report c-scores for searches conducted across 649 transcripts in the PARS set with data density above 75% (i.e. ≤ 25% missing data), as well as c-scores from the entire HIV-1 RNA genome as probed with 1M7 by Siegfried et al.. The y-axis represents the proportion of data points with c-scores above the cutoff reported on the x-axis, i.e. the survival function defined as 1 − CDF(c), where CDF(c) is the cumulative distribution function. The grey rectangle highlights the dynamic range of c-scores (10.6 to 13.2) obtained at the location of the RRE for all considered RRE paths and HIV-1 SHAPE profiles (see Table 1 for details). The y-axis corresponds to c-score ratios between the 5SL and the 4SL paths (c 5SL /c 4SL ). Results indicate a stable progression of -score ratios initially favoring the 5SL structure until the SHAPE data is comprised by 30% 4SL, at which point the 4SL structure receives higher scores.  Figure S9. Comparison of trained models using an entire dataset and a reduced training subset. The input data are based on the HIV-1 genome probed with 1M7 from the Siegfried set and partitioned into 100 bp fragments to mimic multiple transcripts. Gaussian Mixture Models (black lines) learned by patteRNA as well as Hidden Markov Model parameters for (A) the entire dataset and (B) a training subset determined using KL-divergence. Grey histograms represent the distribution of the SHAPE data. Distributions associated with paired and unpaired nucleotides are shown in blue and red solid lines, respectively(solid colored lines). Individual Gaussian components are highlighted by dashed colored lines (two for each pairing state as the optimal K = 2 for this dataset).