PlantMirP-Rice: An Efficient Program for Rice Pre-miRNA Prediction

Rice microRNAs (miRNAs) are important post-transcriptional regulation factors and play vital roles in many biological processes, such as growth, development, and stress resistance. Identification of these molecules is the basis of dissecting their regulatory functions. Various machine learning techniques have been developed to identify precursor miRNAs (pre-miRNAs). However, no tool is implemented specifically for rice pre-miRNAs. This study aims at improving prediction performance of rice pre-miRNAs by constructing novel features with high discriminatory power and developing a training model with species-specific data. PlantMirP-rice, a stand-alone random forest-based miRNA prediction tool, achieves a promising accuracy of 93.48% based on independent (unseen) rice data. Comparisons with other competitive pre-miRNA prediction methods demonstrate that plantMirP-rice performs better than existing tools for rice and other plant pre-miRNA classification.


Introduction
MicroRNAs (miRNAs) are an important type of short (approximately 20-24 nucleotides (nt)) small non-coding RNA (sRNA), and they are involved extensively in post-transcriptional regulation of gene expression in animals, plants, and viruses [1]. In plants, the primary transcript of miRNA gene (pri-miRNA) is mainly transcribed from intergenic regions of the genome by RNA polymerase II, and then pri-miRNA is cleaved into miRNA precursor (pre-miRNA) with characteristic stem-loop (hairpin) structure. Subsequently, pre-miRNA, which is exported to the cytoplasm under the action of HASTY protein, is cleaved by Dicer-like (DCL) enzyme into a miRNA duplex, consisting of a miRNA and miRNA* strand. miRNA duplex is further processed into mature miRNA in the cytoplasm. Finally, mature miRNA is included into RNA-induced silencing complex (RISC), and it then mediates the degradation or transcription inhibition of messenger RNA (mRNA) through the principle of complementary base pairing [2][3][4][5].
It has been confirmed that plant miRNAs are crucial regulators in plant growth, development, and stress resistance [6,7]. Particularly, Oryza sativa L. is an important crop and staple food for Asian countries, thus Oryza sativa miRNAs attract much attention. As a typical example, Wang et al. validated that miR164a, as a general negative regulator, is involved in rice immunity against the blast fungus by targeting OsNAC60. Furthermore, they argued that the miR164a/OsNAC60 module may be considered as a common immune regulator for diverse pathogens [8]. Moreover, yield of rice can be greatly Genes 2020, 11, 662 2 of 11 increased by shaping inflorescence architecture through blocking miR396b with direct induction of the OsGRF6 gene (miR396b-OsGRF6 module) [9]. The effects of miR396c-OsGRF4-OsGIF1 regulatory module on grain size and yield of rice can be confirmed [10]. In addition, Swetha et al. revealed that major domestication-related phenotypes are related to loss of miRNA-mediated laccase silencing in Indian rice [11]. Transgenic microRNA-14 rice has been shown to have high resistance to rice stem borer [12].
Identification of miRNAs is a foundation for dissecting their regulatory functions. Traditionally, identifying miRNAs with experimental methods is inevitably time-consuming, cost-expensive, and even leads to many miRNAs being missed [13]. Next-generation sequencing technology has made it possible to identify miRNAs in genome-scale with high sensitivity. Currently, miRNAs are identified based on deep-sequencing technology followed by bioinformatics analyses and/or wet methods, such as Northern blot analyses and qPCR assay. Identification of miRNAs from deep-sequencing reads is almost exclusively based on identification of characteristic hairpin structures of pre-miRNA sequences. The pre-miRNA sequences are obtained by firstly mapping short sequences to genome, and then searching those loci that may produce stable stem-loop structures [14][15][16][17][18]. However, in order to improve true positive rate, many mapping loci with low density of read coverage may be directly discarded even if their flanking sequences can perfectly form hairpin structures. Therefore, these miRNA biogenesis-based approaches may miss some low abundance miRNAs [14][15][16][17][18].
Compared with the former methods, machine-learning-based methods do not require genomic information and expression information, and mainly leverage sequence and structure features of pre-miRNAs. Thus, machine-learning-based methods can be used for de novo prediction of miRNAs, i.e., without using a reference genome. In fact, some of these tools have been successfully used to distinguish pre-miRNAs from other RNA sequences [19][20][21][22][23]. Although, like animal pre-miRNAs, plant pre-miRNAs also have the stem-loop structures, secondary structure of plant pre-miRNAs is more complicated than that of animal pre-miRNAs, which makes plant pre-miRNA prediction more difficult. This may be a reason why few of prediction tools are designed specifically for plant pre-miRNAs. In 2010, Xuan et al. constructed a support vector machine-based (SVM-based) classifier (PlantMiRNAPred) with positive pre-miRNAs from eight plant species. The PlantMiRNAPred SVM performs excellently in the classification of real and pseudo plant pre-miRNAs [24]. In addition, other tools have also be developed for plant pre-miRNA detection, such as random forest-based HuntMi [25], decision tree-based miRNAprediction [26], and SVM-based miPlantPreMat [5]. Previously, we also developed plantMirP for prediction of plant pre-miRNAs by incorporating five knowledge-based energy features with 48 sequence and structure features.
Although some efforts have contributed to this area, no tool has been implemented specifically for rice pre-miRNA prediction. We argue that the performance of plant pre-miRNA prediction can be further improved if species-specific information embedded in sequences are artfully extracted and characterized by well-constructed features. To do this, we present a new set of knowledge-based energy features, then merge them with other features carefully selected from published studies to form a feature set. Based on rice pre-miRNAs from the miRBase database, we designed a random forest-based classifier of plantMirP-rice (hereinafter called riceMirP) to specifically predict rice pre-miRNAs. The riceMirP exhibits a very promising performance: an accuracy of 93.48%, sensitivity of 87.91%, specificity of 98.15%, and Mathew's correlation coefficient of 0.8710 based on independent (unseen) rice data.

Data Preparation
After removing sequences containing non-AUCG nucleotides, 604 Oryza sativa pre-miRNAs were obtained from the miRBase database (release 22) (http://www.mirbase.org/) [27] and considered as a positive dataset. We randomly selected 422 real pre-miRNAs as positive training samples and used the others as independent positive testing samples. According to the existing method, negative datasets were produced from Oryza sativa protein coding sequences (CDSs) downloaded from the PlantGDB database (http://www.plantgdb.org/). To be specific, all CDS sequences of Oryza sativa were joined together to form a non-overlapping long sequence. Non-overlapping segments were extracted by fragmenting this long CDS sequence. The secondary structures of extracted segments and real rice pre-miRNAs were predicted by using RNAfold software (http: //rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi) with the default parameters. The extracted segments are referred to as pseudo pre-miRNAs if they have hairpin structures similar to that of the pre-miRNAs. Furthermore, in order to ensure high similarity between pseudo and real pre-miRNAs, two criteria for randomly selecting pseudo pre-miRNAs are that: (1) the number of paired bases (including GU wobble pairs) in the secondary structures of pseudo pre-miRNAs should not be lower than the minimum number of paired bases in the secondary structures of real pre-miRNAs and (2) the folding free energy of pseudo pre-miRNAs should not be higher than the maximum of the folding free energy of the real pre-miRNAs. Finally, 502 pseudo pre-miRNAs were randomly selected as negative training samples, and 216 pseudo pre-miRNAs as independent negative testing samples.

Feature Extraction
The sequences (character strings) of real and pseudo pre-miRNAs are submitted into RNAfold to predict secondary structures (structure strings) with the default parameters ( Figure 1). The character strings and structure strings of real and pseudo pre-miRNAs are reversed to obtain the corresponding reversed strings ( Figure 1). As previously mentioned, in the predicted secondary structure, paired or unpaired nucleotides indicated by "(" in 5 end and ")" in 3 end are indistinguishably represented by "(" [20,28,29]. Then, the Needleman-Wunsch algorithm is used to align structure strings and the corresponding reversed structure strings. The symbol of "-" is used to fill the gap between the aligned structure strings (Figure 1). The aligned character strings are obtained according to the positions of "-" in the aligned structure strings ( Figure 1). The aligned character strings are divided into 20 segments and the ratios of character pair w (w ∈ {AA, AU, AG, AC, UU, UG, UC, GG, GC, CC, A-, U-, G-, C-}) are calculated in each segment. The ratios of character pair w in the s-th segment for positive and negative samples are added up, respectively, and denoted by P w (s) and N w (s) Finally, position-dependent potentials U w (s) of character pair w in the s-th segment are calculated according to formula below: Here P w (s) and N w (s) are required to be larger than 0. w runs through all possible character pairs, and s runs through all segments. Finally, a set of energy scores for a given sample is obtained according to the following formulae: These energy scores of S(w) and S(s) are known as the knowledge-based energy score 1 (hereinafter called energy score 1) in order to distinguish them from the pre-existing knowledge-based energy features presented in our previous study [29]. The flowchart of feature extraction for knowledge-based energy score 1 is displayed in Figure 1. Genes 2020, 11, x FOR PEER REVIEW 4 of 12

Performance Evaluation
In a classification problem, four measures of the sensitivity (Se), specificity (Sp), accuracy (Ac), and Mathew's correlation coefficient (MCC), are widely used to evaluate the prediction model, and they are calculated according to the following definitions [30][31][32]: Here TP (true positive) and FP (false positive) are the number of correctly-and incorrectlypredicted samples in positive samples, while TN (true negative) and FN (false negative) are the number of correctly-and incorrectly-predicted samples in negative samples. The receiver operating characteristic (ROC) curve is plotted for visualizing classification performance. Then, the area under the receiver operating characteristic (AUC) curve is used as a comprehensive measure to evaluate the classification algorithm.

Performance Evaluation
In a classification problem, four measures of the sensitivity (Se), specificity (Sp), accuracy (Ac), and Mathew's correlation coefficient (MCC), are widely used to evaluate the prediction model, and they are calculated according to the following definitions [30][31][32]: Here TP (true positive) and FP (false positive) are the number of correctly-and incorrectly-predicted samples in positive samples, while TN (true negative) and FN (false negative) are the number of correctlyand incorrectly-predicted samples in negative samples. The receiver operating characteristic (ROC) curve is plotted for visualizing classification performance. Then, the area under the receiver operating characteristic (AUC) curve is used as a comprehensive measure to evaluate the classification algorithm.

Input, Output, Dependencies, Platforms, and Application Scenarios
RiceMirP, a random forest-based classifier, is implemented in Perl (v5.24.1) and R (v3.2.2), with the recommended versions in parentheses. The Random Forest algorithm is implemented by the randomForest R package (v4. [6][7][8][9][10][11][12][13][14]. RiceMirP is available as a stand-alone package where all necessary scripts and data (including data used in this study) are contained. The local package of riceMirP is freely available to the academic community at https://github.com/yygen89/riceMirP. RiceMirP can run on both Windows and Linux platforms. RiceMirP is designed in an easy-to-use manner and is an installation-free software. Before running riceMirP, the dependencies of Perl, R, and randomForest R package are required to be preinstalled in local machines. RiceMirP requires three FASTA-formatted input files: the first two files containing nucleotide sequences of positive and negative training samples, respectively, and the last one containing nucleotide sequences of testing samples. The output file of riceMirP includes the three-column contents of the identifier, predicted label (positive or negative), and corresponding score for each testing sample. Please make sure that the identifier for each sequence in FASTA-formatted files is unique. Generally, a higher score indicates that the testing sample is more likely to be a predicted label. RiceMirP can be applied to a sequence classification problem-for a given sequence, what is the likelihood of this given sequence being positive. For miRNA prediction from small RNA sequencing data, riceMirP, in conjunction with another kind of miRNA biogenesis-based approach, can further reduce the false-positive rate.

The Algorithm for the Prediction of Rice Pre-miRNAs
As we all know, for machine-learning-based classification and prediction, it is a quite crucial and challenging task to extract appropriate features to train a model. In our previous study, a set of knowledge-based energy features was constructed by tactfully combining a widely-used k-mer scheme in bioinformatics with distance-dependent potential in statistical physics [29]. In addition, knowledge-based energy features have been demonstrated to have very high discriminatory power [29], which suggests that relative position (or distance distribution) information of k-mer pairs is very valuable. In this study, we extend the distance-dependent potential presented previously [29] to position-dependent potential, and then construct a new set of knowledge-based energy features (energy score 1) (see details in Materials and Methods). In addition to the 34 novel features presented here, in order to further improve prediction performance, we also collect 49 sequence and structure features from previously-published studies. Full features used in riceMirP are listed in Table 1.
In order to train the prediction model, 422 known Oryza sativa pre-miRNAs (the positive training dataset) and 502 "pseudo" pre-miRNAs (the negative training dataset) were merged together to obtain a training dataset (see details in Materials and Methods). Based on this training dataset, 4-, 6-, 8-, and 10-fold cross-validations (CVs) were performed to evaluate performance of riceMirP. The ROC curves of 4-, 6-, 8-, and 10-fold CVs overlapped almost completely (Figure 2), which shows that riceMirP is very robust. Furthermore, the AUC values of 4-, 6-, 8-, and 10-fold CVs were 0.9787, 0.9785, 0.9782, and 0.9779, respectively (Figure 2), which indicates that riceMirP is a very promising tool. It is worth noting that if only using knowledge-based energy score 1 (including 34 features), the AUC value of the 10-fold CV was still as high as 0.936 (Figure 2). This result demonstrates that the above-mentioned features of energy score 1 constructed based on position-dependent potential have very high discriminating power. Moreover, an independent (unseen) testing dataset including 182 positive and 216 negative samples was randomly selected to evaluate the prediction performance of riceMirP (see details in Materials and Methods). As shown in Figure 3, riceMirP with full features had a promising Ac of 0.9348, Se of 0.8791, Sp of 0.9816, and MCC of 0.8710.
We compared riceMirP to the state-of-the-art plantMirP, which was recently designed especially for prediction of plant pre-miRNAs. Based on the same training dataset of riceMirP, the 10-fold CV was implemented for riceMirP and plantMirP, respectively. Figure 4 shows that the AUC values of riceMirPp and plantMirP were 0.9779 and 0.9686. Therefore, riceMirP performs better to plantMirP in rice pre-miRNA classification.

Comparison with Other Competitive Methods
In order to test whether riceMirP can be used for other plant pre-miRNA prediction, we compared riceMirP with competitive pre-miRNA prediction methods. In addition, to avoid any possible bias from data used in riceMirP, all comparisons were performed based on the data of other tools or the data from the third party. We firstly compared riceMirP to plantMirP based on the training dataset of plantMirP. Likewise, the performance comparison between two tools was visualized by the ROC curve of 10-fold CV ( Figure 5). It is evident that riceMirP is slightly superior to plantMirP in plant pre-miRNA prediction.
Further, we compared riceMirP with miPlantPreMat, which is an SVM-based classifier for

Comparison with Other Competitive Methods
In order to test whether riceMirP can be used for other plant pre-miRNA prediction, we compared riceMirP with competitive pre-miRNA prediction methods. In addition, to avoid any possible bias from data used in riceMirP, all comparisons were performed based on the data of other tools or the Genes 2020, 11, 662 8 of 11 data from the third party. We firstly compared riceMirP to plantMirP based on the training dataset of plantMirP. Likewise, the performance comparison between two tools was visualized by the ROC curve of 10-fold CV ( Figure 5). It is evident that riceMirP is slightly superior to plantMirP in plant pre-miRNA prediction.
Genes 2020, 11, x FOR PEER REVIEW 9 of 12 classification results reported previously [24] are directly adopted for comparison. For all testing datasets (except the "updated aly"), the accuracies of riceMirP were higher than those of PlantMiRNAPred ( Figure 7). Furthermore, the overall accuracy of riceMirP was much better than those of microPred and triplet-SVM (Figure 7). The above-mentioned results indicate that the schemes of feature extraction and algorithm presented here are universal and are not limited to rice.   Further, we compared riceMirP with miPlantPreMat, which is an SVM-based classifier for identifying plant pre-miRNAs and the corresponding mature miRNAs. Similarly, in order to avoid potential effects from dataset, we firstly used the dataset ("mirPlantPre19_single.txt" & "negData.txt") of miPlantPreMat to train a prediction model for riceMirP. Then, the dataset ("mirPlantPre20_single.txt") of miPlantPreMat was considered as a positive testing dataset, and submitted into miPlantPreMat and riceMirP for prediction, respectively. Because there is no negative testing dataset in the package of miPlantPreMat, the negative samples from third-party (i.e., PlantMiRNAPred) were collected as an independent negative testing dataset, and then directly submitted into riceMirP and miPlantPreMat for prediction. In this way, the Ac, Sp, and Se values of riceMirP and miPlantPreMat were obtained, respectively. Obviously, riceMirP achieved better classification performance than miPlantPreMat in plant pre-miRNA prediction ( Figure 6).
Finally, we compared riceMirP with triplet-SVM, microPred, and PlantMiRNAPred. In particular, PlantMiRNAPred is designed specifically for prediction of plant pre-miRNAs, and achieves >90% accuracy on multiple plant datasets. Likewise, the training dataset ("train_negative_980_seq.txt" & "train_positive_980_seq.txt") of PlantMiRNAPred was used as a training dataset to train model for riceMirP. Then, 11 testing datasets of PlantMiRNAPred were submitted into riceMirP to calculate prediction accuracy. These testing datasets of PlantMiRNAPred included three parts: the known plant pre-miRNAs from eight species, which were used for evaluating the ability of identifying the real pre-miRNAs; the 1142 negative testing samples, which were used for testing the ability of identifying the pseudo hairpins; and the updated dataset, which were used to observe the ability of discovering new plant pre-miRNAs. Because there is no stand-alone version of PlantMiRNAPred, and the web-server of PlantMiRNAPred was not available, classification results reported previously [24] are directly adopted for comparison. For all testing datasets (except the "updated aly"), the accuracies of riceMirP were higher than those of PlantMiRNAPred ( Figure 7). Furthermore, the overall accuracy of riceMirP was much better than those of microPred and triplet-SVM (Figure 7). The above-mentioned results indicate that the schemes of feature extraction and algorithm presented here are universal and are not limited to rice.

Conclusions
In this study, a promising random forest-based classifier, riceMirP, was constructed specifically for predicting rice pre-miRNAs by combining 34 novel knowledge-based energy features with 49 other existing sequence and structure features extracted from published studies. Particularly, riceMirP was superior to the state-of-the-art plantMirP in rice pre-miRNA prediction, which suggests that it is useful to construct a prediction tool specifically for rice pre-miRNAs. In addition, the extensive comparisons with existing pre-miRNA prediction methods, such as plantMirP, miPlantPreMat, PlantMiRNAPred, triplet-SVM, and microPred demonstrated that riceMirP also exhibits higher classification performance in other plant pre-miRNA prediction. Moreover, these

Conclusions
In this study, a promising random forest-based classifier, riceMirP, was constructed specifically for predicting rice pre-miRNAs by combining 34 novel knowledge-based energy features with 49 other existing sequence and structure features extracted from published studies. Particularly, riceMirP was superior to the state-of-the-art plantMirP in rice pre-miRNA prediction, which suggests that it is useful to construct a prediction tool specifically for rice pre-miRNAs. In addition, the extensive comparisons with existing pre-miRNA prediction methods, such as plantMirP, miPlantPreMat, PlantMiRNAPred, triplet-SVM, and microPred demonstrated that riceMirP also exhibits higher classification performance in other plant pre-miRNA prediction. Moreover, these above-mentioned results also illustrate that the novel knowledge-based energy features (i.e., energy score 1) proposed here have very high discriminatory power, and that the scheme of feature extraction presented here is universal and is not limited to rice. Taken together, the results obtained in this study might be beneficial for subsequent researches.

Conflicts of Interest:
The authors declare no conflict of interest.