Non-coding RNAs (ncRNAs) are functional regulatory molecules that mediate cellular processes, including chromatin remodeling, transcription, post-transcriptional modifications, and signal transduction [1
]. They can be roughly divided into two categories: housekeeping and regulatory ncRNAs [3
]. Housekeeping ncRNAs, such as ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), small nuclear RNAs (snRNAs), and small nucleolar RNAs (snoRNAs), are responsible for basic cellular functions. Regulatory ncRNAs, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), and long non-coding RNAs (lncRNAs), are capable of moving between cells, performing cell-to-cell signaling in development or physiology, and mediating epigenetic inheritance [4
In plants, most studies usually identify a single class of ncRNAs, such as miRNAs or long non-coding RNAs [4
]. Although several studies have identified miRNAs in plants [9
], the identification has not always been performed in the most reliable way [12
]. Criteria to identify and annotate miRNAs have been in constant improvement over time [12
Coffee is one of the world’s most popular beverages, and 80% of it is produced by 25 million smallholders. Around 125 million people worldwide depend on coffee for their livelihoods. Coffea canephora
is one of the most important agricultural commodities, corresponding to nearly 40% of the world coffee production [16
] and the first coffee species with a publicly available genome annotation [17
]. The Coffea canephora
genome annotation has identified 25,574 protein-coding genes, and approximately 50% of the genome is composed of transposable elements [18
]. The repertoire of annotated non-coding RNAs in that study [18
] is restricted to microRNAs (92 precursors). To date, information about ncRNAs for Coffea canephora
is not clearly organized.
A total of seven studies have predicted miRNAs in C. canephora
using distinct approaches. Five studies have focused only on prediction [18
]. Two studies have used small RNA sequencing to confirm the transcriptional activity of miRNAs [23
]. There is only one study [22
] that has focused on a comprehensive analysis of the C. canephora
sequenced genome to annotate miRNAs, but this study has not taken account of recent changes in miRNA annotation rules in plants. For example, the criteria concerning expression analysis and the curation of precursors have become more strict, and they have an impact on defining high-quality prediction of miRNA loci [12
In this context, there is a gap in a highly accurate ncRNAs annotation for coffee. For that, we provided the most extensive and organized ncRNA loci catalog of the Coffea canephora genome. We characterized snRNAs, snoRNAs, miRNAs, tRNAs, rRNAs, lncRNAs and performed a manual standardization of the previously identified microRNAs in the species.
The results of the homology strategies show that ncRNAs searches require multiple combinations of computational strategies to detect the diversity of structure of the RNA families. In the case of snoRNA detection, the use of a specific tool for this kind of structure allowed a six-fold increase in the number of hits.
Several curation steps also improved the identification of miRNAs in the C. canephora
genome. Due to the high number of false-positive miRNAs identified in plants, we chose to follow a strict set of rules to select high-confidence candidates in our prediction and previous studies. We noticed that more than half of miRNAs previously predicted in C. canephora
could be false-positives, and even among the precursors that matched curation criteria, there were redundant sequences. miR-408 prediction and validation [24
] did not match all curation criteria, but it was included in the final dataset due to its experimental validation. Thus, it is important to emphasize that wet-lab approaches will still be necessary to biologically validate miRNA families when “big data analysis” is not able to confirm their existence.
The application of clear and rigorous criteria is an important contribution to define miRNA families when data mostly relies on high throughput approaches, but it will probably underestimate miRNA families in genomes. Among the 72 miRNA precursors that matched curation criteria, 70 were variants of highly conserved plant miRNA families (detailed in Table S12, Supplementary File S2, and Supplementary File S3
). The remaining two precursors belonged to family miR-157, involved in regulating vegetative phase change in A. thaliana
], and family miR-3627, known for regulating plant metabolism and disease resistance [51
At least four miRNA families (miR-160, miR-479, miR-827, and miR-1446), considered as highly conserved in plants, were previously identified in C. canephora
, but they were excluded in our analysis. MiR-160 has been identified in two studies [22
], but it was excluded because of the mature miRNA size and lack of expression. MiR-160 regulates the auxin response factors ARF10, -16, and -17 in plants [24
] negatively. MiR-479 has been previously identified [23
], but it was excluded because of the mature miRNA size. MiR-479 regulates plant metabolism and disease resistance, as in V. vinifera
]. MiR-827 has been previously identified [22
], but it was excluded in our analysis because the precursor size was greater than 300 nt and the unpaired position of mature miRNAs in the precursor. MiR-827 negatively regulates the expression of NITROGEN LIMITATION ADAPTATION (NLA), a ubiquitin E3 ligase gene (AT1G02860) in A. thaliana
]. MiR-1446 has been previously identified [22
], but it was excluded because of the lack of expression. MiR-1446 regulates disease resistance, as in S. lycopersicum
In summary, we delivered here a consolidated and a highly-curated annotation for the ncRNA complement of C. canephora genome, unifying, revising, and expanding previous analyses.
Although many studies have pointed out plant ncRNAs playing key roles in developmental [47
] and regulatory processes [79
], it is still uncommon to find studies identifying more than one or two ncRNA classes. Nevertheless, our analysis showed that, with several curation steps, it is possible to better assign most of the expected “housekeeping ncRNAs” and predict regulatory ncRNAs with high-confidence. Besides, merging predicted miRNAs from this study with results from previous studies, we obtained the first highly curated miRNA dataset for C. canephora
Applying rigorous criteria based on the most recent plant miRNA annotation recommendation [15
], we concluded that over 70% of C. canephora
predicted precursor miRNAs were possibly false-positives. This was the most extensive genomic catalog of curated ncRNAs in the Coffea
genus. The annotation of the Coffea canephora
non-coding RNAs provided initial steps for a better understanding of the small RNA system in plants. Furthermore, it provided valuable research to establish curated non-coding RNA annotations for other plant genomes.