PINC: A Tool for Non-Coding RNA Identification in Plants Based on an Automated Machine Learning Framework

There is evidence that non-coding RNAs play significant roles in the regulation of nutrient homeostasis, development, and stress responses in plants. Accurate identification of ncRNAs is the first step in determining their function. While a number of machine learning tools have been developed for ncRNA identification, no dedicated tool has been developed for ncRNA identification in plants. Here, an automated machine learning tool, PINC is presented to identify ncRNAs in plants using RNA sequences. First, we extracted 91 features from the sequence. Second, we combined the F-test and variance threshold for feature selection to find 10 features. The AutoGluon framework was used to train models for robust identification of non-coding RNAs from datasets constructed for four plant species. Last, these processes were combined into a tool, called PINC, for the identification of plant ncRNAs, which was validated on nine independent test sets, and the accuracy of PINC ranged from 92.74% to 96.42%. As compared with CPC2, CPAT, CPPred, and CNIT, PINC outperformed the other tools in at least five of the eight evaluation indicators. PINC is expected to contribute to identifying and annotating novel ncRNAs in plants.


Introduction
RNA is the template that codes for the proteins required to create cellular functions. RNA is structurally similar to DNA, but its function and chemical composition are fundamentally different. At a higher level, RNA is divided into two main groups: coding RNA that accounts for approximately 2% of all RNAs, and non-coding RNA (ncRNA) that accounts for the majority (>90%) of RNAs [1]. Non-coding RNA refers to all RNAs that are transcribed from DNA but do not code for proteins. Additionally, ncRNA can be categorized into two groups according to the size of the sequence: long non-coding RNAs (lncRNAs) with sequences >200 nucleotides and small non-coding RNAs (sncRNAs) with sequences shorter than 200 nucleotides [2]. In previous research, ncRNAs have frequently been referred to as "useless genes" or transcriptional "noise" [3,4]. In contrast, a growing number of experiments have demonstrated that ncRNAs play important biological roles in a variety of biological processes, including gene regulation/expression, gene silencing, RNA modification and processing, as well as multiple important roles in life activities [5][6][7]. Numerous plant-specific biological processes, including the regulation of plant nutrient homeostasis, development, and stress responses, have been linked to ncRNAs [8][9][10]. MiR-NAs and trans-acting siRNAs, for instance, contribute to leaf senescence in Arabidopsis; miR164 and its target ORE1 control leaf senescence in Arabidopsis, and as miR164 expression declines, ORE1 expression eventually increases [11]. In addition, overexpression of miR398b has been shown to decrease the transcript levels of genes encoding superoxide dismutase (CSD1, CSD2, SODX, and CCSD), which resulted in the production of reactive oxygen species (ROS) and increased rice resistance to Magnaporthe oryzae [12,13]. In recent After comparing PINC with the CPC2, CPAT, CNIT, and CPPred identification tools on nine independent test sets to validate the performance of PINC, we discovered that PINC performed exceptionally well on these independent test sets. This suggests that PINC is a reliable method for ncRNA identification in plants. In addition, users can upload their data for identification, which facilitates the study of plants that have received less attention.

Training Setup
Once the features were selected, the models were tuned to find the best parameters, and the results were validated using a five-fold cross-validation procedure. A benchmark dataset of 4000 randomly selected data from each class was constructed for training and validation. Meanwhile, to ensure the validity of the experiments, we repeated the above experiments 100 times. As shown in Figure 1A, the highest accuracy of the 100 experiments was 95.32% and the lowest was 94.52%, mostly distributed between 94.6% and 94.9%, with very small fluctuations. For further proof, we averaged the accuracy of every fifth experiment, as shown in Figure 1B, and the curve fluctuates even less. This result shows that the randomly selected data is representative of the entire data set. Therefore, we took 4000 randomly selected data from each class as our baseline dataset.
automatically. Our experimental results include a number of significant contributions: (1) By combining the F-test and variance threshold, 10 out of 91 features were identified as being able to strongly distinguish between ncRNAs and coding RNA in plants. (2) Using the AutoML framework, a neutral model for non-coding RNA identification was obtained. (3) We combined the two previous points and developed a tool called PINC for ncRNA identification. After comparing PINC with the CPC2, CPAT, CNIT, and CPPred identification tools on nine independent test sets to validate the performance of PINC, we discovered that PINC performed exceptionally well on these independent test sets. This suggests that PINC is a reliable method for ncRNA identification in plants. In addition, users can upload their data for identification, which facilitates the study of plants that have received less attention.

Training Setup
Once the features were selected, the models were tuned to find the best parameters, and the results were validated using a five-fold cross-validation procedure. A benchmark dataset of 4000 randomly selected data from each class was constructed for training and validation. Meanwhile, to ensure the validity of the experiments, we repeated the above experiments 100 times. As shown in Figure 1A, the highest accuracy of the 100 experiments was 95.32% and the lowest was 94.52%, mostly distributed between 94.6% and 94.9%, with very small fluctuations. For further proof, we averaged the accuracy of every fifth experiment, as shown in Figure 1B, and the curve fluctuates even less. This result shows that the randomly selected data is representative of the entire data set. Therefore, we took 4000 randomly selected data from each class as our baseline dataset.

Performance Comparison of the Feature Selection Methods
In this research, 91 features were filtered using four feature selection methods: F-test, variance threshold, RF, and variance threshold combined with F-test (VT-F). These feature selection methods were compared in order to assess their usefulness. These feature selection methods use learning curves to continuously reduce the number of available features and to select the most appropriate features. The maximum validation set accuracy was 94.77 percent when the first 31 features were chosen using F-test filtering, and it was 94.29 percent when the first 25 features were chosen using variance threshold filtering. For VT-F, features below the mean were first filtered out using a variance threshold, and then the remaining features were filtered using the F-test, with a maximum accuracy of 95.25 percent when the first 10 features were selected. The evaluation of the three previously described feature selection methods was based on the AutoGluon model. For RF, the range of features was narrowed down based on the importance of the features,

Performance Comparison of the Feature Selection Methods
In this research, 91 features were filtered using four feature selection methods: Ftest, variance threshold, RF, and variance threshold combined with F-test (VT-F). These feature selection methods were compared in order to assess their usefulness. These feature selection methods use learning curves to continuously reduce the number of available features and to select the most appropriate features. The maximum validation set accuracy was 94.77 percent when the first 31 features were chosen using F-test filtering, and it was 94.29 percent when the first 25 features were chosen using variance threshold filtering. For VT-F, features below the mean were first filtered out using a variance threshold, and then the remaining features were filtered using the F-test, with a maximum accuracy of 95.25 percent when the first 10 features were selected. The evaluation of the three previously described feature selection methods was based on the AutoGluon model. For RF, the range of features was narrowed down based on the importance of the features, with the highest accuracy of 93.27 percent when the first 21 features were selected, and the 21 features were then fed into AutoGluon with an accuracy of 94.72 percent. In addition to accuracy, we compared SE, F1, MCC, and SPC performance metrics. Table 1 demonstrates that the method combining the F-test and variance threshold for feature selection outperformed the other methods, and the 10 features it selected were GC content, score, cdsStop, cdsSize, and T, C, GT, GC, ACG, and TAT frequencies. The experiments analyzed the distribution of ncRNAs and coding RNAs on the dataset for these 10 features, and based on Figure 2, it can be seen that they play a significant role in the identification of discriminatory power. In addition, we conducted a correlation analysis between the ten features selected for the classification task. Figure 3 showed that, GC content had a weak correlation with the other features. Score, cdsStop, and cdsSize showed a stronger correlation with the other features. T, C, GT, GC, ACG, and TAT frequencies had the strongest correlation with the other features.

Comparison of Models
Regarding the validation set, in this study, we compared the five-fold cross-validation results of four AutoML frameworks, AutoGluon, TPOT, H2O, and AutoKeras, to those of three conventional machine learning models, i.e., random forest, SVM, and Naive Bayes ( Table 2). It is evident that, in general, conventional machine learning models are less effective than AutoML; three of the four automated machine learning frameworks produced more effective models than the random forest, the best performing conventional machine learning model. AutoGluon achieved the best results for five of the eight evaluation metrics within the AutoML framework: ACC, F1, MCC, NPV, and SE. H2O achieved the best results for AUC, while Autokeras achieved the best results for PPV and SPC. It is evident that the AutoGluon framework is more effective than the other frameworks, possibly because AutoGluon employs per-variable embedding, which improves quality via gradient flow, whereas the other frameworks merely apply the standard feed-forward architecture to hot-coded data. The Autokeras effect, which is based on NAS that combines multiple search strategies such as random search, grid search, etc., is only marginally weaker than the AutoGluon effect. The goal of NAS is to reduce human intervention and to allow the algorithm to design the neural network automatically, which consists of three key components: the search space, the search strategy, and the evaluation strategy. However, this process is typically very time-consuming. H2O had the highest AUC score, but its overall performance was comparable to that of conventional machine learning models and TPOT. H2O is a distributed machine learning platform based on the Java programming language, unlike other AutoML frameworks. TPOT was the least effective AutoML and the only framework with overall lower results than conventional machine learning models. This is likely due to the genetic algorithm employed by TPOT, which tends to converge on a locally optimal solution prematurely. Consequently, the comparison demonstrates that the models created by the AutoGluon framework are superior to those created by the other four automatic machine learning frameworks and the three conventional machine learning models. Table 2. Performance comparisons among five automated machine learning frameworks and three conventional machine learning models.

Comparison Tools against Plant Datasets
To evaluate the accuracy of PINC in ncRNA and coding RNA identification, we compared it to CPC2, CPAT, CNIT, and CPPred. We compared the identification accuracy for nine plant species from four databases, GreeNC, CANTATA, RNAcentral, and Phytozome, using five different tools. It is evident from the results shown in Figure 4 that our tool has the highest degree of precision for all nine plants. The large fluctuation of CPPred indicates that it has poor generalization performance, whereas the other three tools have some stability. However, it can be seen that the identification accuracy of PINC is greater than that of the other three tools, indicating that our tool performs the best among the different plant species. To compare the performances of these five tools further, we used eight metrics: sensitivity (SE), specificity (SPC), accuracy (ACC), F1-score, PPV, NPV, MCC, and AUC to evaluate and compare the five tools for these nine independent test sets (Table 3). We plotted the ROC curve ( Figure 5); it can be seen that the ROC curve for PINC differs from the other tools. A true positive rate is rapidly achieved (1.0) at the cost of a relatively high false positive rate. Therefore, we have also plotted PR curves ( Figure 6) to further illustrate the performance of PINC. The results showed that the PR curve of PINC did not fluctuate markedly and had a decreasing trend when the threshold was greater than 0.8. Meanwhile, PR curves illustrated that Precision and Recall values of five plants (Cicer arietinum, Manihot esculenta, Nymphaea colorata, Sorghum bicolor, and Zea mays) were higher than the other tools at the same threshold. All those results showed that PINC had the superior performance for distinguishing ncRNAs from coding RNAs. Solanum tuberosum outperformed the other tools in seven of the eight evaluation metrics and at least five of the remaining eight test sets, namely, SE ACC, F1, NPV, and MCC. The high Se score indicates that the probability of missing is small; therefore, PINC is the best choice for ncRNA identification. For the specificity SPC score, only one dataset was higher than the other tools, with four datasets performing best on CNIT and two datasets performing best on CPC2 and CPAT, respectively. However, the difference between the SPC of PINC and the SPC of the other tools was not large, and all tools had high performances above 86.99%. Among the five tools, PINC was the most effective for ncRNA identification in the nine plants. This indicates that our tool has a strong generalization to plants, which is crucial for non-model plants. Int

Discussion
In the field of bioinformatics, automated machine learning methods are now beginning to be implemented. In our experiments, we compared four automatic machine learning frameworks that are good matches for the more recently introduced frameworks and the older frameworks. For all the automatic machine learning frameworks, we used the same preprocessing methods to process the data as a raw input, then, we adjust the parameters of each framework in order to find the most suitable parameters, and finally we output the model. In general, we consider automatic machine learning frameworks to be black boxes and do not examine frame-specific methods for automatically optimizing parameters and integrating the model for direct output. Automated machine learning frameworks automatically optimize models, thereby reducing the time and effort devoted by researchers and, to a certain extent, allowing non-experts in machine learning to solve bioinformatics problems.
Utilizing high-quality features is one way to improve performance in machine learning. It is necessary to find features that are suitable for ncRNA identification in the study because providing or discovering good features is one of the most important tasks in machine learning. We extracted k-mer frequency features, coding sequence features, and other features during our experiments. Despite the fact that traditional k-mer features have been used in a variety of studies, such as gene identification [33], subcellular localization [34], and sequence analysis, it has been demonstrated that the k-mer frequency is highly effective at detecting ncRNAs [35]. Many tools have also used features related to coding sequences and some other features [36]. Ninety-one extracted features were filtered using our feature selection method; the filtered features successfully identified ncRNAs and it was the most precise tool, to date, for ncRNA identification in all plant species.
For ncRNA identification, there are additional factors to consider, such as the tradeoff between sensitivity and specificity. At present, the number of ncRNAs is small as compared with the number of coding RNAs identified. To prevent ncRNAs from being missed, high sensitivity is important. Currently, CPAT, CNCI, CPPred, and CPC2 are less sensitive and focus more on identifying coding RNAs, but this requires an additional step to screen for non-coding RNAs. In contrast, the high sensitivity of PINC reduces the necessity for additional filtering processes. Moreover, PINC demonstrated a higher rate of accuracy than any other tool among the nine plants evaluated. Although some tools for non-coding RNA identification have reached over 85 percent accuracy, increased accuracy is not meaningless, as large amounts of data have become available due to advances in sequencing technology, and it is possible that for every one percent increase, hundreds of additional correct RNAs can be identified. Here, PINC achieves a high degree of ncRNA identification precision. This may be because the model in PINC adopts the stacking strategy, while other tools use single models such as SVM, logistic regression, and xgboost. For a long time, the performance of combining the predicted results of multiple models has been better than that of a single model, and the variance has been significantly reduced [37]. In the experiment, we selected the default basic model in the AutoGluon framework. Here, the basic model is trained separately, and then the prediction of the basic model is used as a feature to train the stacked model. Stacked models can improve the shortcomings of a single-model prediction and can take advantage of their interactions to improve the prediction ability [38]. In addition, it can be seen from the feature level distribution map described earlier that these features also have strong discrimination ability.
In addition, we plan to continue research in two areas: first, deep learning, which can automatically extract features, reduce the time required to extract features, and can improve the accuracy of cross-species recognition. In contrast, we should consider machine learning techniques to gain a deeper understanding of these RNA types and to investigate their biological significance. In addition, for plants, only a handful of ncRNA functions have been identified; once these functions are identified, new mechanisms can be explored and new features can be added to PINC to improve our tool further.  To create the dataset, RNA sequences were obtained from the GreeNC, CANTATA, RNAcentral, and Phytozome databases. Secondly, feature selection methods were used to extract and filter features. Finally, machine learning models were compared to determine the most effective model for ncRNA identification.

Dataset Construction
To construct the experimental dataset, we considered two factors. On the one hand, the diversity of plants and the abundance of annotation data were taken into consideration. On the other hand, considering the balance of the data, we chose four plants as our training and validation datasets (Table 4), which included two model plants, i.e., Arabidopsis thaliana and Oryza sativa, in addition to two non-model plants, i.e., Glycine max and Vitis vinifera. We used ncRNAs as the positive sample data and coding RNAs as the negative sample data in the dataset. Negative samples were obtained from Phytozome.v13 [39]. Positive samples were obtained from three public databases, including GreeNC [40], CANTATA [41], and RNAcentral [42]. For all data, first, we used cd-hit-est-2D in the CD-hit tool [43] to eliminate redundant sequences between the test and training sets at a threshold of 80% [22,35,44,45]. Second, in order to balance the datasets, random selections of 4000 data were made for each plant, of which 2000 were positive samples and 2000 were negative samples. The positive sample data consisted of 1800 lncRNAs and 200 sncRNAs, and the negative sample data consisted of 2000 mRNAs (Table 5) [18,46]. Thus, the baseline dataset consisted of a total of 16,000 protein sequences from four plants. Meanwhile, we analyzed the length distribution of the positive and negative datasets, as shown in Figure 8. The median length of the coding RNAs data was 1029 and the data were mostly concentrated in the range of 0-2000. The ncRNA data had a median length of 321 and the data were mostly concentrated in the range of 0-1000. Finally, we proportionally divided the dataset into 70% training data and 30% validation data. Additionally, nine independent test sets were created for nine plants. (Table 6): Cicer arietinum, Gossypium darwinii, Lactuca sativa, Manihot esculenta, Musa acuminata, Nymphaea colorata, Solanum tuberosum, Sorghum bicolor, and Zea mays. To eliminate redundant sequences, the data for these nine independent test sets were taken from the four databases mentioned above and filtered at a threshold of 80%.

Feature Extraction and Selection
This experiment initially extracted 91 features ( Table 7). The 86 features of k-mer frequency, sequence length, and GC content were obtained using the Python script program (https://github.com/midisec/PINC, accessed on 22 August 2022); the five features of Score and CDS were obtained using the UCSC Genome txCdsPredict program in the browser (http://hgdown-load.soe.ucsc.edu/admin/jksrc.zip, accessed on 11 November 2014) [47]. These features can be classified into three categories: k-mer frequency features, CDS-related features, and other features. The k-mer frequency describes all possible frequencies for the presence of k nucleotides in a sequence, based on methods that have initially been implemented in whole genome shotgun assemblers. When k = 1, each nucleotide can contain a maximum of four A, C, G, or T. When k equals 2, the calculation involves the dinucleotide frequency (i.e., AA, AT, AG, AC, ... TT) and a total of 4 2 = 16 species. When k = 3, the calculated three-nucleotide frequencies (i.e., AAA, AAT, AAG, AAC, ... TTT) are computed for a total of 4 3 = 64 species. By combining 1-3mer frequencies for a total of 84 features, k-mer frequencies can capture rich statistical information about negative profiles in plant genomes, according to some research [48]. CDS is the result of encoded proteins that are interchangeable with ORF in some ways, but differ slightly [49]. The features Score, cdsStarts, cdsStop, cdsSize, and cdsPercent comprise the second major category of features. Score is the predicted protein score; if it is >800, there is a 90% chance that it is a protein, and if it is >1000, it is virtually certain that it is a protein. cdsStop is the end of the coding region in the transcript, cdsSize is cdsStop minus cdsStart, and cdsPercent is the ratio of cdsSize to the total sequence length. Other features include sequence length and GC content, which are widely used for ncRNA

Feature Extraction and Selection
This experiment initially extracted 91 features ( Table 7). The 86 features of k-mer frequency, sequence length, and GC content were obtained using the Python script program (https://github.com/midisec/PINC, accessed on 22 August 2022); the five features of Score and CDS were obtained using the UCSC Genome txCdsPredict program in the browser (http://hgdown-load.soe.ucsc.edu/admin/jksrc.zip, accessed on 11 November 2014) [47]. These features can be classified into three categories: k-mer frequency features, CDS-related features, and other features. The k-mer frequency describes all possible frequencies for the presence of k nucleotides in a sequence, based on methods that have initially been implemented in whole genome shotgun assemblers. When k = 1, each nucleotide can contain a maximum of four A, C, G, or T. When k equals 2, the calculation involves the dinucleotide frequency (i.e., AA, AT, AG, AC, . . . , TT) and a total of 4 2 = 16 species. When k = 3, the calculated three-nucleotide frequencies (i.e., AAA, AAT, AAG, AAC, . . . , TTT) are computed for a total of 4 3 = 64 species. By combining 1-3-mer frequencies for a total of 84 features, k-mer frequencies can capture rich statistical information about negative profiles in plant genomes, according to some research [48]. CDS is the result of encoded proteins that are interchangeable with ORF in some ways, but differ slightly [49]. The features Score, cdsStarts, cdsStop, cdsSize, and cdsPercent comprise the second major category of features. Score is the predicted protein score; if it is >800, there is a 90% chance that it is a protein, and if it is >1000, it is virtually certain that it is a protein. cdsStop is the end of the coding region in the transcript, cdsSize is cdsStop minus cdsStart, and cdsPercent is the ratio of cdsSize to the total sequence length. Other features include sequence length and GC content, which are widely used for ncRNA identification. Sequence length indicates the total length of the sequence. GC content is the ratio of guanine and cytosine to the other four DNA bases.

PINC
There may be redundant features among the 91 features listed above; therefore, we employed feature selection to filter them. For the feature selection method, redundant features were filtered out using a combination of variance threshold filtering and the F-test. Variance threshold filtering is used to filter features based on their own variance. The smaller a feature's variance, the less significant its variation, and these insignificant features are eliminated. F-test is a method to determine the relationship between each feature and label. The GC content, Score, cdsStop, cdsSize, and T, C, GT, GC, ACG, and TAT frequencies were among the 91 features identified by this combined feature selection method. Finally, these 10 features were used as the model input.

Model Construction
Machine learning (ML) is currently utilized in a variety of fields to solve numerous difficult problems. Nevertheless, model construction for machine learning requires human intervention. Manual intervention is required in the feature extraction, model selection, and parameter adjustment processes, which require professionals to optimize and can waste a significant amount of time and resources if errors occur. To reduce these repetitive development costs, the concept of automating the entire machine learning process, automatic machine learning, has been conceived (AutoML). The definition of AutoML is that it is a combination of automation and ML [50]. From an automation standpoint, AutoML can be viewed as the design of a framework to automate the entire machine learning process, allowing models to automatically learn the correct parameters and configurations without manual intervention. From the standpoint of machine learning, AutoML is a system that is highly capable of learning and generalizing given data and tasks. Recent research on AutoML has focused on the neural network architecture search (NAS) method, which employs a search strategy to test and evaluate a large number of architectures in a search space, and then selects the one that best meets the objectives of a given problem by maximizing the adaptation function. However, the NAS faces two obstacles to the method: first, the amount of computation is excessive, resulting in increased resource consumption. Second, instability may vary each time and the search structure is altered, resulting in varying precision. In our experiments, we compared four automatic machine learning frameworks, AutoGluon, H2O, TPOT, and Autokeras, with three conventional machine learning models, SVM, RF, and Naive Bayes. We determined that AutoGluon was the superior framework, and therefore it was used as the classifier. AutoGluon contains 26 base models including random forest, XGBoost, and a neural network, and in our experiments, we used all the base models for training the model [51]. AutoGluon is an open-source machine learning training framework for tabular data. It is a framework that attempts to avoid a hyperparametric search as much as possible, training multiple models concurrently and weighting them using a multi-layer stacking strategy to obtain the final output.

Performance Evaluation
Several widely used performance metrics were evaluated in the experiments, including sensitivity (SE), specificity (SPC), accuracy (ACC), F1-score, positive predictive value (PPV), negative predictive value (NPV), and the Matthews correlation coefficient (MCC). To evaluate the performance of the classifier numerically and visually, the area under the curve (AUC) and ROC curves were also used. These definitions are as follows:

Conclusions
Various tools have been developed to distinguish between ncRNAs and coding RNAs, the majority of which have used scientific computational methods to differentiate sequences and to accelerate the annotation of various human genes. In addition to nucleotides with high discriminatory power in 1-3-mer, we also extracted other features such as the sequence's definition, composition, and function. Moreover, we combined F-test and variance threshold filtering and found that the combined method was superior to the individual methods of F-test and variance threshold filtering. A number of automated machine learning and traditional machine learning frameworks were also used for modeling, in which the validation set was carefully evaluated and analyzed, including the use of crossvalidation on the validation set available, with AutoGluon performing the best. Then, we compiled these into a tool called PINC and compared it to nine other tools on nine test sets, demonstrating that PINC performed better than other tools on all of these species. For user convenience, a user-friendly web (http://www.pncrna.com/, accessed on 22 August 2022) has been developed, where the output can be obtained simply by entering a FASTA sequence or file. Overall, PINC has excellent predictive properties, permits cross-species plant identification, and is a practical and user-friendly tool.