Non-Redundant tRNA Reference Sequences for Deep Sequencing Analysis of tRNA Abundance and Epitranscriptomic RNA Modifications

Analysis of RNA by deep-sequencing approaches has found widespread application in modern biology. In addition to measurements of RNA abundance under various physiological conditions, such techniques are now widely used for mapping and quantification of RNA modifications. Transfer RNA (tRNA) molecules are among the frequent targets of such investigation, since they contain multiple modified residues. However, the major challenge in tRNA examination is related to a large number of duplicated and point-mutated genes encoding those RNA molecules. Moreover, the existence of multiple isoacceptors/isodecoders complicates both the analysis and read mapping. Existing databases for tRNA sequencing provide near exhaustive listings of tRNA genes, but the use of such highly redundant reference sequences in RNA-seq analyses leads to a large number of ambiguously mapped sequencing reads. Here we describe a relatively simple computational strategy for semi-automatic collapsing of highly redundant tRNA datasets into a non-redundant collection of reference tRNA sequences. The relevance of the approach was validated by analysis of experimentally obtained tRNA-sequencing datasets for different prokaryotic and eukaryotic model organisms. The data demonstrate that non-redundant tRNA reference sequences allow improving unambiguous mapping of deep sequencing data.


Supplement
Non-redundant tRNA reference sequences for deep sequencing analysis of tRNA abundance and epitranscriptomics modifications Florian PICHOT 1,2 , Virginie MARCHAND 2 , Mark HELM 1 , Yuri MOTORIN 2,3 * Figure S1 Number of genomic tRNA genes detected by tRNAScanSE (source gtRNAdb). Genomes (~250 in total) were randomly chosen to provide representative selection for Archaea (35 genomes), Bacteria (150 genomes) and Eukaryota (65 genomes). Only tRNA genes corresponding to 20 standard amino acids were considered. Panel A represents global distribution (number of tRNA genes in log10 scale), panel B -same data sorted by Kingdom. Panel C shows the number of tRNA genes in non-redundant (blue) and in optimized (Step2) references (red), in function of the total number of tRNA genes in genomic reference. Panel D, E and F show distribution by Kingdom and phyla/groups.

Figure S1
Number of genomic tRNA genes detected by tRNAScanSE (source gtRNAdb). Genomes (~250 in total) were randomly chosen to provide representative selection for Archaea (35 genomes), Bacteria (150 genomes) and Eukaryota (65 genomes). Only tRNA genes corresponding to 20 standard amino acids were considered. Panel A represents global distribution (number of tRNA genes in log 10 scale), panel B -same data sorted by Kingdom. Panel C shows the number of tRNA genes in nonredundant (blue) and in optimized (Step2) references (red), in function of the total number of tRNA genes in genomic reference. Panel D, E and F show distribution by Kingdom and phyla/groups.

Figure S2
Alignment results for Non Duplicated (non redundant) tRNA reference (Step1) and optimized tRNA set (Step2) for D. melanogaster and H. sapiens references, maximal distance used is 8 substitutions. Boxplot on the left shows the proportion of tRNA sequencing reads aligned to Step2 reference ('Aligned Step2') and proportions of uniquely and multiply mapped reads at both steps. Red dashed line indicate 12.5% level. Increase of unique mapping and decrease of multiple mapping is shown by arrows. Barplots at the right represent unique and multiple mapping by tRNA species at Step2, in proportion to total and in absolute number of sequencing reads obtained by tRNA, expressed as proportion to total number of mapped reads. tRNAs showing excessive proportion of ambiguous mapping are shown in red.

Figure S3
Sequences of tRNA species showing excessive ambiguous mapping for D. melanogaster and H. sapiens references. Non-identical nucleotides are in bold case, anticodon is in red and underlined.
Bottom -cloverleaf structures of H. sapiens tRNA Val 3 (AAC) and tRNA Ala 1 (NGC). Substitutions are in bold case. Anticodon is in red. Excessive number of mismatched nucleotides in tRNA Val 3 (AAC) stems may drive its instability (degradation) in vivo. Conserved identity elements for AlaRS aminoacylation (A73 and G3*U70) are boxed n bleu. 3'-CCA sequence not shown.

Figure S4
Barplots representing unique and multiple mapping by tRNA species for final manually curated tRNA references (Step 3), in proportion to total and in absolute number of sequencing reads obtained by tRNA, expressed as proportion to total number of mapped reads.

Conclusion: Very high level of identity between two tRNAs Leu_cluster3=bs_tRNALeu_CAG is removed from the reference
Inosine-34 is known to be present in: >bs_tRNAArg_ACG gcgcccgtagctcaattggatagagcgtttgactGcggatcaaaaggttaggggttcgactcctctcgggcgcgcca