PreLnc: An Accurate Tool for Predicting lncRNAs Based on Multiple Features

Accumulating evidence indicates that long non-coding RNAs (lncRNAs) have certain similarities with messenger RNAs (mRNAs) and are associated with numerous important biological processes, thereby demanding methods to distinguish them. Based on machine learning algorithms, a variety of methods are developed to identify lncRNAs, providing significant basic data support for subsequent studies. However, many tools lack certain scalability, versatility and balance, and some tools rely on genome sequence and annotation. In this paper, we propose a convenient and accurate tool “PreLnc”, which uses high-confidence lncRNA and mRNA transcripts to build prediction models through feature selection and classifiers. The false discovery rate (FDR) adjusted p-value and Z-value were used for analyzing the tri-nucleotide composition of transcripts of different species. Conclusions can be drawn from the experiment that there were significant differences in RNA transcripts among plants, which may be related to evolutionary conservation and the fact that plants are under evolutionary pressure for a longer time than animals. Combining with the Pearson correlation coefficient, we use the incremental feature selection (IFS) method and the comparison of multiple classifiers to build the model. Finally, the balanced random forest was used to construct the classifier, and PreLnc obtained 91.09% accuracy for 349,186 transcripts of animals and plants. In addition, by comparing standard performance measurements, PreLnc performed better than other prediction tools.


Introduction
Long non-coding RNAs (lncRNAs), defined as a transcript with low protein-coding potential over 200 nucleotides in length, are initially considered as a "noise" of transcription because the expression level and sequence conservation of them are lower than those of messenger RNAs (mRNAs) [1]. However, in recent years, accumulating evidence indicates that lncRNAs exist widely in eukaryotes and are essential elements of the transcriptome [2]. For instance, some lncRNAs have the cis or trans function of possessing gene expression [3], and they can regulate gene expression at multiple levels by complementing homologous sequences of RNA or DNA, or by forming molecular frameworks and scaffolds assembled through structural macromolecular complexes [4][5][6]. In the gene expression network, some lncRNAs act as important regulators, regulating the nuclear structure and transcription of the cell nucleus, mRNA stability, translation and cytoplasmic post-translational modifications [7]. It is precisely because of the various specific expressions of lncRNAs in organisms that the annotation of the sequences has fallen behind. Therefore, it is of great biological significance to identify lncRNAs from multitudinous sequences.
The transcriptome analysis shows that more than 90% of the eukaryotic genome is transcribed, but only 1% to 2% of the genome can encode proteins, which indicates that it is quite crucial to evaluate the protein-encoding ability of transcripts [8,9]. Similar to the structure of mRNAs, lncRNAs have a poly tail and promoter structure after splicing [10]. Since lncRNAs and mRNAs have certain similarities and both have biological functions [11], the identification and differentiation of lncRNAs and mRNAs deserve further study. Moreover, previous studies have shown that some lncRNAs expressed with high abundance are poorly conserved, while some low-expressed lncRNAs have important functions [12]. Therefore, it is less feasible to seek the regularity directly from the stability of lncRNAs, the protection of lncRNAs between species, and the expression level of lncRNAs [12,13].
With the continuous development of sequencing technology and a large number of species being sequenced, researchers can predict lncRNAs from abundant candidate transcripts with scientific calculations. Since the recognition of lncRNAs and RNA from transcripts is a binary classification problem, many machine learning algorithms have been applied to diverse methods, such as CPAT [14], PLEK [15], CPC2 [16], lncRScan-SVM [17], LncFinder [18], PLncPRO [19], RNAplonc [20], etc. CPAT is a tool for evaluating protein-coding ability based on the linear regression model, which is as popular and species-neutral as CPC2. CPC2 uses a support vector machine (SVM) model with the standard radial basis function kernel, which can rapidly predict the protein-coding ability of sequences and provide an online website. PLEK adopts an improved k-mer scheme and support vector machine algorithm to separate lncRNAs from mRNAs. Compared to the input sequence-based method described above, LncRScan-SVM relies on GTF files of query transcripts and focuses on predicting lncRNAs. LncFinder identifies long non-coding RNA utilizing sequence intrinsic composition, structural information, and physicochemical property, which is released as an R package and a web server. PLncPRO and RNAplonc are tools for predicting plant lncRNAs, providing identification models of monocotyledonous and dicotyledonous plants. In combination with the RNA-seq experiment, thousands of transcripts can be generated, but it is difficult and time consuming for researchers to filter thousands of lncRNAs predictions. Predictive software allows researchers to select the most likely candidate sequences for experimental validation, which will help researchers perform functional verification of lncRNAs more efficiently. However, the general applicability and convenience of the method still need to be improved.
Many methods rely on various genetic identification databases and other sequence alignment files, leading to a complex prediction process that takes up a lot of time and space. In this paper, we propose a convenient and accurate tool "PreLnc", which uses high-confidence lncRNA and mRNA transcripts to build prediction models through feature selection and classifiers. The false discovery rate (FDR) adjusted p-value and Z-value were used for analyzing the tri-nucleotide composition of transcripts of different species. Combining with the ranking features through the Pearson correlation coefficient, we use the incremental feature selection (IFS) method and the comparison of multiple classifiers to build the model. Finally, the balanced random forest was used to construct the classifier, and the model training PreLnc can effectively deal with the imbalance of lncRNAs and mRNAs. Besides, by comparing standard performance measurements with other prediction tools, PreLnc performed better in many aspects and obtained 91.09% accuracy for 349,186 transcripts of animals and plants. Its open-source package is available at https://www.github.com/LeiCao97/PreLnc.

Datasets
The datasets in this paper are mainly from six species, which can be divided into two major categories, one consisting of animals (humans, mice, and cows), and the other consisting of 3 common plants (Arabidopsis thaliana, Oryza sativa, and Zea mays). For animal datasets, we rigorously filtered long non-coding and protein-coding transcripts with 'transcript_biotype' as 'lncRNA' and 'protein_coding', respectively, from the Ensembl (v97) database (http://asia.ensembl.org/) [21]. Due to the absence of known lncRNAs in some existing plants, long non-coding transcripts of plants were obtained from the GreeNC database, which is a repository of lncRNAs annotated in plants and algae specifically [22]. We selected protein-coding transcripts as negative samples from the EnsemblPlants (v44) database [23]. In order to effectively establish the model, we used CD-hit [24] and set parameters (c = 0.9, aS = 0.9) to filter out highly similar 19,181 lncRNAs and 75,673 mRNAs. For model training and testing, we further divided the datasets into training sets and testing sets, which were randomly sampled from complete datasets (see Table 1). All the datasets were independent of each other and are available online at https://github.com/LeiCao97/PreLncData.

Feature Selection and Extraction
Given that the features play a crucial role in the prediction, we selected a variety of features based on the definition, structure and composition of lncRNAs and mRNAs, which can be directly obtained by scientific calculations. Since each selected feature affects the prediction performance, we selected 11 important features as the first candidate subset through analyzing existing lncRNAs prediction methods, which can be divided into four categories (see Table 2). First, sequence length, GC content and standard deviation (SD) of stop codon counts (SCC) (see Equations (1) and (2)) are the basic features directly derived from transcripts [16].
where x represents stop codon counts of three frames (SCC 1 , SCC 2 , SCC 3 ). Second, the structural features are composed of open reading frame (ORF) integrity and some parameters related to the coding sequence (CDS) (CDS length, CDS score and CDS percentage) [17,25], which can be evaluated by txCdsPredict program from UCSC [26]. Theoretical isoelectric point (PI) and length of a predicted peptide, which are related to peptides and calculated by the "ProtParam" module in BioPython [27], were selected as features of the third category. The fourth category is composed of two simple and functional definition features that distinguish protein-coding and non-coding transcripts. Fickett TESTCODE score was used to evaluate combination effects based on nucleotide composition and codon usage bias (see Equations (3)-(8)) [28], whereas hexamer score was selected to evaluate the dependencies between adjacent amino acids (see Equation (9)) [14].
N 1 = Number of A s in positions 1, 4, 7, · · · (4) N 2 = Number of A s in positions 2, 5, 8, · · · (5) N 3 = Number of A s in positions 3, 6, 9, · · · (6) in Equation (3), N content means the proportion of nucleotide N {A, T, C, G}. In Equations (4) to (7), N 1 , N 2 , N 3 measure the asymmetry in the distribution of each base among the three codon positions and N pos measures the deviation of each base from one codon to another. Hexamer score can be calculated by the following equation.
where F(h i ) and F (h i )(i = 0, 1, . . . , 4095) represent in-frame hexamer frequency, which can be measured by a log-likelihood ratio between coding and non-coding training data sets, respectively. In RNA molecules, each of the adjacent tri-nucleotides is defined as a codon, representing an amino acid during protein synthesis [29]. Considering that classification performance remains a major concern, we proposed to add a subset of features for animals and plants, respectively, by taking tri-nucleotides as another candidate combinations, which performed well in distinguishing lncRNAs and mRNAs [30,31]. It is worth mentioning that the transcripts between plants and animals or even between each species have certain differences. Therefore, it is meaningful to analyze and select uniformly valid features among 64 tri-nucleotides to enhance classification performance. According to the proportion of each trinucleotide in the transcript, we conducted the significance test of the statistical hypothesis test [32].
p-value is a method for statistical test data to fall within the range of error probability, and it is set as a threshold value to measure the significance (≤5% is significant, ≤1% is quite significant). The false discovery rate is often applied to multiple test corrections to the p-value, while the FDR adjusted p-value is more rigorous to measure its significance [33]. The corresponding Z-value can intuitively reflect the degree of difference, which can be a good measure of the correlation between the inner K-mer of the sequence and the category of structure or even function [34]. The adjusted p-value and Z-value were applied to analyze the rules of tri-nucleotides in lncRNAs and mRNAs of different species (see Equations (10)-(13)) [35], to analyze the consistent tri-nucleotides between animals, as well as plants.
Adjust − P value (i) = P value (i) × length(P)/rank(P) where X 1 , X 2 represent the averages of a certain tri-nucleotide in the positive and negative sample set, S 1 , S 2 represent the standard deviation, and n 1 , n 2 represent the numbers of lncRNAs and mRNAs. In equation 12, p-value is the probability that the test statistic (Z C ) is greater than or equal to the test statistic value calculated based on the observed sample set (|Z|) when the test data is lncRNA (µ = µ 0 ). In equation 13, p-values are sorted from small to large, and the adjusted p-value is obtained by calculating the current p-value (P value (i)) multiplied by the total number length(P) divided by the sorting number rank(P). In order to select the tri-nucleotides with the strongest influence as the features, we set the FDR adjusted p-value at 1% and ranked the significance of the absolute value of Z-value. Feature selection method usually has a great influence on the whole research process, thus pearson correlation coefficients (PCCs) [36] and incremental feature selection methods [37] were used to study the correlation between features and transcript types. Firstly, we use the Pearson correlation coefficient to rank the features and remove redundant features (|r| > 0.8) to ensure their independence from each other. Then, combining all the min-max scaling normalized feature parameters from training sets, an incremental feature selection method was adopted to conduct efficient classification based on the ranking features and classifiers, which was evaluated with the 10-fold cross-validation [37,38]. Moreover, we analyzed the recognition bias of these features, such as amino acids corresponding to tri-nucleotides that may have an important effect on protein synthesis, for mRNAs of different species. Feature extraction was mainly conducted through modified Python scripts from three known software (CPAT [14] used for Feature 3, CPC2 [16] used for Feature 4-7 and lncRScan-SVM [17] used for Feature 8-11).

Model Foundation and Evaluation
Feature selection and different classifiers will affect the final prediction performance. Therefore, we combined the incremental feature selection method with a variety of machine learning classifiers, including logistic regression (LR) [39], SVM [40], decision tree (DT) [41], random forests (RF) [42], and K-nearest neighbor methods (KNN) [43]. Based on python's 'sklearn' package, parameters of the 10-fold cross-validation, including sensitivity (SEN) and F-measure (F) (see Equations (14) and (15), were evaluated to build a model with good prediction performance and high applicability. Taking into account the imbalance of the training sets, we divided the positive and negative samples into the data set into different proportions, and finally determined the classifier to balance the random forest. In the overall modeling process, all features were extracted from the transcripts of each species for individual training, and lncRNAs prediction model was created (see Figure 1). All input files are in FASTA format and do not rely on other gene annotation files.
Genes 2020, 11, x FOR PEER REVIEW 6 of 22 individual training, and lncRNAs prediction model was created (see Figure 1). All input files are in FASTA format and do not rely on other gene annotation files.

Analysis of Tri-Nucleotide Differences Among Species
We used Z-value stacking histograms to show the tri-nucleotide differences among species and the effect of tri-nucleotides on protein-coding ability (see Figures 2 and 3). The Z-value fragment of the same trinucleotide shows the significant degree of differentiating transcripts between species.

Analysis of Tri-Nucleotide Differences Among Species
We used Z-value stacking histograms to show the tri-nucleotide differences among species and the effect of tri-nucleotides on protein-coding ability (see Figures 2 and 3). The Z-value fragment of the same trinucleotide shows the significant degree of differentiating transcripts between species. The difference in the Z-value stacking value of different trinucleotides shows the difference in the transcript sequence itself, which implies that the composition of lncRNA in plants is significantly different. In Figure 2, the columnar segmentation shows the range span of Z-value on humans, mice, and cows (0.183~34.656, 0.079~17.809 and 0.08~12.512), indicating that the significance of tri-nucleotides on humans is greater than that on mice and cows. However, Figure 3 shows that the Z-value span is more differentiated among A. thaliana, O. sativa and Z. mays (0.027~28.755, 0.021~69.513 and 0.152~85.182). Accordingly, the significance of the tri-nucleotides on Z. mays was greater than that on the other two plants.
Genes 2020, 11, x FOR PEER REVIEW 7 of 23 The difference in the Z-value stacking value of different trinucleotides shows the difference in the transcript sequence itself, which implies that the composition of lncRNA in plants is significantly different. In Figure 2, the columnar segmentation shows the range span of Z-value on humans, mice, and cows (0.183~34.656, 0.079~17.809 and 0.08~12.512), indicating that the significance of trinucleotides on humans is greater than that on mice and cows. However, Figure 3 shows that the Zvalue span is more differentiated among A. thaliana, O. sativa and Z. mays (0.027~28.755, 0.021~69.513 and 0.152~85.182). Accordingly, the significance of the tri-nucleotides on Z. mays was greater than that on the other two plants.
Cumulative absolute value of Z-value tri-nucleotide human mouse cow A few tri-nucleotides extracted from two datasets were filtered out because adjusted P-value does not meet the significance condition, despite the fact that the cumulative absolute value of Zvalue of them is relatively high. Consequently, 15 tri-nucleotides were used as another feature subset for animals, and 24 tri-nucleotides were used for plants. As can be seen in the Figures 2 and 3   A few tri-nucleotides extracted from two datasets were filtered out because adjusted p-value does not meet the significance condition, despite the fact that the cumulative absolute value of Z-value of them is relatively high. Consequently, 15 tri-nucleotides were used as another feature subset for animals, and 24 tri-nucleotides were used for plants. As can be seen in the Figures 2 and 3, 15 tri-nucleotides  red-labeled acted on animals, namely, ACG, AGC, CAG, CAT, CCA, CGG, CGT, GAC, GAG, GAT,  GGC, GGG, TAC, TAG, and TCA, and 24 tri-nucleotides red-labeled acted on plants, namely, AAA, AAC, AAT, ACC, ACT, AGC, AGG, ATC, ATG, CAA, CAG, CAT, CGA, CTA, GAC, GAG, GAT, GGA, GGG, GTA, TAA, TAG, TAT, and TTA, as they can significantly affect the prediction compared with the other tri-nucleotides. Detailed parameters of adjusted p-value and Z-value can be downloaded for viewing (Additional File 1: Table S1).

Correlation Analysis and Ranking Lists of Features
The Pearson correlation coefficient can not only measure the correlation between individual characteristics and transcription classes, but also eliminate redundant characteristics. Therefore, we enumerated the correlation coefficients of all features in the form of heat maps (see Figures 4 and 5).   As can be seen from Figures 4 and 5, for both animals and plants, four features, namely, standard deviation of stop codon counts, CDS length, CDS score, and peptide length, are highly positively correlated with each other (|r| > 0.8). According to the definition and calculation of highly correlated features, we filter out CDS length and peptide length to ensure the independence between features. The standard deviation of the stop codon count that measures the deviation of the three frames in the ORF is retained, since not all ORFs can be expressed as protein products. As can be seen from Figures 4 and 5, for both animals and plants, four features, namely, standard deviation of stop codon counts, CDS length, CDS score, and peptide length, are highly positively correlated with each other (|r| > 0.8). According to the definition and calculation of highly correlated features, we filter out CDS length and peptide length to ensure the independence between features. The standard deviation of the stop codon count that measures the deviation of the three frames in the ORF is retained, since not all ORFs can be expressed as protein products.
According to the importance of individual features in distinguishing lncRNAs from mRNAs, we ranked features to prepare for feature selection (see Figures 6 and 7). On the whole, the features of animals are easier to distinguish from transcripts than plants. Although the sorted list cannot intuitively indicate the impact of features on predictive classification, it is certain that the predictive model of animals will be more versatile than that of plants.
According to the importance of individual features in distinguishing lncRNAs from mRNAs, we ranked features to prepare for feature selection (see Figures 6 and 7). On the whole, the features of animals are easier to distinguish from transcripts than plants. Although the sorted list cannot intuitively indicate the impact of features on predictive classification, it is certain that the predictive model of animals will be more versatile than that of plants.

Results of Incremental Feature Selection Method with Multiple Classifiers
To ensure that the model can better predict lncRNAs and mRNAs, we integrated the results of the feature selection and classifiers on animals and plants to unify the final standards of the model. Figure 8 shows the dynamic change of F parameters in combination with features and multiple classifiers. In general, a random forest was selected as the final classifier because its performance on six species is significantly better than other classifiers. As far as animals are concerned, the F parameter (0.91503) of the top 20 features of humans are slightly lower than that (0.91738) of the top

Results of Incremental Feature Selection Method with Multiple Classifiers
To ensure that the model can better predict lncRNAs and mRNAs, we integrated the results of the feature selection and classifiers on animals and plants to unify the final standards of the model. Figure 8 shows the dynamic change of F parameters in combination with features and multiple classifiers. In general, a random forest was selected as the final classifier because its performance on six species is significantly better than other classifiers. As far as animals are concerned, the F parameter (0.91503) of the top 20 features of humans are slightly lower than that (0.91738) of the top 19 features and the consistency between species. The top 20 features with high F parameters (0.91503 for humans, 0.92555 for mice and 0.94723 for cows, respectively) were finally used to build animal prediction model (see Figure 8A-C). Compared with animals, random forests have a good classification effect on plants, and the F parameter is generally higher than 0.99 (see Figure 8D-F). The results of the random forest algorithm with ranking features demonstrated good stability on plants, thus we retain all features to further strengthen the model's balance and generalization ability. The detailed data of SEN parameters and F parameters can be downloaded for viewing (Additional Files 2 and 3: Tables S2 and S3). In most plant lncRNA databases, lncRNAs transcript data of many species have yet to be mined, resulting in an imbalance in the number of lncRNAs and mRNAs data. Considering that users can build prediction models for any species using the PreLnc tool, balanced random forests were performed for automatic feature selection [44,45]. We divided the positive and negative samples of the training set into different proportions for comparison (see Figure 9). Balanced random forests In most plant lncRNA databases, lncRNAs transcript data of many species have yet to be mined, resulting in an imbalance in the number of lncRNAs and mRNAs data. Considering that users can build prediction models for any species using the PreLnc tool, balanced random forests were performed for automatic feature selection [44,45]. We divided the positive and negative samples of the training set into different proportions for comparison (see Figure 9). Balanced random forests perform poorly on uneven animal datasets with F-measure gaps of 15.93% for humans, 13.18% for mice, and 14.52% for cows. Different from animals, F-measure gaps of plants are smaller (6.77% for A. thaliana, 6.40% for O. sativa, and 3.53% for Z. mays, respectively). Therefore, for the common absence of lncRNAs on plants, our method shows considerable advantages.

Comparison and Analysis of Prediction Results
In this paper, the performance of PreLnc is evaluated by comparing standard measurement parameters with some common prediction tools, including CPAT, PLEK, CPC2, and LncFinder, as well as some plant tools, including PLncPRO and RNAplonc. We mainly analyzed the balance and bias of the tool's prediction performance with SEN, SPE, ACC, MCC and AUC scores. The AUC scores of LncFinder were not listed because it directly determines the type of transcripts. MCC scores were lower, due to the large data difference between the positive and negative sample sets, but it did not affect the comparison between the various tools.
As a result, PreLnc had a significant advantage in many aspects, especially for SEN, ACC, and MCC (see Table 3). PreLnc has the highest prediction performances, especially for humans, mice, A. thaliana and Z. mays, but it lacks in prediction performances for cows and O. sativa. Additionally, ROC curves were used to represent the AUC scores and authenticity of these methods, showing the great performance of PreLnc (see Figure 10). PreLnc obtained 91.09% accuracy for 349,186 transcripts of animals and plants. On the whole, PreLnc has a higher recognition rate and maintains a balance in distinguishing lncRNAs and mRNAs.

Comparison and Analysis of Prediction Results
In this paper, the performance of PreLnc is evaluated by comparing standard measurement parameters with some common prediction tools, including CPAT, PLEK, CPC2, and LncFinder, as well as some plant tools, including PLncPRO and RNAplonc. We mainly analyzed the balance and bias of the tool's prediction performance with SEN, SPE, ACC, MCC and AUC scores. The AUC scores of LncFinder were not listed because it directly determines the type of transcripts. MCC scores were lower, due to the large data difference between the positive and negative sample sets, but it did not affect the comparison between the various tools.
As a result, PreLnc had a significant advantage in many aspects, especially for SEN, ACC, and MCC (see Table 3). PreLnc has the highest prediction performances, especially for humans, mice, A. thaliana and Z. mays, but it lacks in prediction performances for cows and O. sativa. Additionally, ROC curves were used to represent the AUC scores and authenticity of these methods, showing the great performance of PreLnc (see Figure 10). PreLnc obtained 91.09% accuracy for 349,186 transcripts of animals and plants. On the whole, PreLnc has a higher recognition rate and maintains a balance in distinguishing lncRNAs and mRNAs. The numbers in bold are the highest parameters of the prediction results. CPAT, CPC2, PreLnc, PLEK and LncFinder predicted lncRNA in all species, and RNAplonc and PLncPRO specifically predicted plant lncRNA. PreLnc has obvious recognition advantages in humans, mice, A. thaliana and Z. mays, and the AUC value maintains a high score in identifying lncRNAs and mRNAs of 6 species. LncFinder had the best predictive results on cows and O. sativa.

Prediction on Other Known Datasets
Non-coding regulatory RNA is one of the hotspots in life science research, and a large number of lncRNA data and resource databases have been accumulated to facilitate our query and research. To verify the effectiveness of our tools, we further compared the predicted results with other known datasets. We first compared the humans (lncRNAs: 6142, mRNAs: 7485), mice (lncRNAs: 10638, mRNAs: 6460), and A. thaliana (lncRNAs: 2562, mRNAs: 13986) datasets from CPC2, which includes mRNA from the RefSeq database with protein sequences annotated by Swiss-Prot and non-coding transcripts from the Ensembl (v87) and EnsemblPlants (v32) databases. PreLnc performed well on the humans and A. thaliana datasets, but its predictive effect on mice was slightly lacking (see Table 4). CPAT obtained higher SEN scores on mice and A. thaliana, while LncFinder had an advantage in predicting accuracy on mice.

Prediction on Other Known Datasets
Non-coding regulatory RNA is one of the hotspots in life science research, and a large number of lncRNA data and resource databases have been accumulated to facilitate our query and research. To verify the effectiveness of our tools, we further compared the predicted results with other known datasets. We first compared the humans (lncRNAs: 6142, mRNAs: 7485), mice (lncRNAs: 10638, mRNAs: 6460), and A. thaliana (lncRNAs: 2562, mRNAs: 13986) datasets from CPC2, which includes mRNA from the RefSeq database with protein sequences annotated by Swiss-Prot and non-coding transcripts from the Ensembl (v87) and EnsemblPlants (v32) databases. PreLnc performed well on the humans and A. thaliana datasets, but its predictive effect on mice was slightly lacking (see Table  4). CPAT obtained higher SEN scores on mice and A. thaliana, while LncFinder had an advantage in predicting accuracy on mice.   The second dataset was non-coding transcripts of humans and mice from NONCODEv5 [46]. According to the prediction results, PreLnc had higher prediction accuracy for 172,216 humans ncRNA transcripts, reaching 95.319% (see Table 5). CPAT was relatively more accurate (97.594%) in predicting ncRNA transcripts on mice. In general, the prediction results are highly consistent with the above experiments.

Analysis of Model Universality and Generalization Ability
In the above comparative experiments, our algorithm predicted a total of 91.09% prediction accuracy on model organisms. In order to further verify the universality and generalization ability of the model, we evaluated and analyzed the prediction effects of other species' lncRNAs, including Aedes aegypti, Rhesus, Opossum, Platypus, and Pig. The 4274 lncRNA transcripts of A. aegypti were derived from the systematic research project on the Nucleotide Sequence Database (NT) of NCBI, while other species data were derived from NONCODE v5 [46]. A. aegypti lncRNA research covered 117 RNA-seq libraries, and sequencing reads with the average quality score (Phred Score) above 20 were retained for downstream analysis [47]. Compared with model organisms, trained humans models are applied to these less-studied organisms. In Figure 11, the results show that the prediction accuracy of Rhesus, Opossum and Pig are all greater than 93%, and that of A. aegypti and Platypus are all greater than 91%. The prediction results of other species verify the effectiveness of PreLnc for predicting novel lncRNAs and its generalization ability. were retained for downstream analysis [47]. Compared with model organisms, trained humans models are applied to these less-studied organisms. In Figure 11, the results show that the prediction accuracy of Rhesus, Opossum and Pig are all greater than 93%, and that of Aedes aegypti and Platypus are all greater than 91%. The prediction results of other species verify the effectiveness of PreLnc for predicting novel lncRNAs and its generalization ability.

Comparison of Time Consumption
We compared the system computing time consumption of several tools on the same platform. The configuration is Linux, Ubuntu 16.04.6 LTS 64 bit, Intel® X® (R) CPU E5-2682 v4 @ 2.50 GHz and 2 GB memory. The computational time of the three CPC2 data sets was listed. As seen from Table 6, CPC2, CPAT, RNAplonc and PLncPRO all support fast computing performance. In contrast, PreLnc takes longer than LncFinder, nearly 0.5 times longer, while PLEK takes slightly longer than PreLnc. Meanwhile, the time difference between humans and mice indicates that PLEK may have a certain

Comparison of Time Consumption
We compared the system computing time consumption of several tools on the same platform. The configuration is Linux, Ubuntu 16.04.6 LTS 64 bit, Intel®X®(R) CPU E5-2682 v4 @ 2.50 GHz and 2 GB memory. The computational time of the three CPC2 data sets was listed. As seen from Table 6, CPC2, CPAT, RNAplonc and PLncPRO all support fast computing performance. In contrast, PreLnc takes longer than LncFinder, nearly 0.5 times longer, while PLEK takes slightly longer than PreLnc. Meanwhile, the time difference between humans and mice indicates that PLEK may have a certain preference for sequence length.

Discussion
There are still many problems to be solved in the process of dealing with prediction. For the diversity of species, higher compatibility methods need to be developed under the premise of ensuring the accuracy of prediction. The specificity and relevance of multiple species can be explored in conjunction with effective scientific calculations to extract some of the more biologically significant features.
In this paper, we use a cross-species method to identify lncRNAs and study the intrinsic tri-nucleotide differentiation between animal and plant transcripts. Based on the adjusted p-value and Z-value analysis (see Figures 2 and 3), we found that the Z-value span of tri-nucleotides varies more between plants than between animals, which is consistent with evolutionary conservation and the fact that plants are under evolutionary pressure for a longer time than animals [48,49]. Taking codons into account, every three adjacent nucleotides in an RNA molecule are grouped together to represent a certain type of amino acid during protein synthesis [29]. Therefore, in terms of mRNA transcripts, it is meaningful to analyze the corresponding amino acids from the significant differentiation of tri-nucleotides [29,30]. During protein synthesis, the amino acids of the above six species are prominent in threonine, serine, glutamine, histidine, arginine, aspartic acid, glutamic acid, glycine, tyrosine and stop codons (UAG). Besides, there is still a tri-nucleotide corresponding amino acid named proline that is more important for mRNA transcripts of animals. Different from animals, lysine, asparagine, isoleucine, leucine, valine, methionine (starting codon AUG), termination code (UAA) play a major role in the protein synthesis of plants.
As for iterative incremental feature selection, the classification effect of individual features between species is presented in the process. The CDS percentage (Feature: CDS_pencent) prediction effect significantly exceeds 0.8 except for the KNN classifier on animals (see Figure 8A-C). In terms of plants, the transcript length (Feature: Length) has an outstanding effect on the classification of transcripts. The F parameter exceeds 0.70 on A. thaliana, and exceeds 0.81 on O. sativa and Z. mays. Meanwhile, the CDS score (Feature: CDS_score) significantly enhanced the classification ability on all species. Therefore, CDS-related features are more conducive to distinguish between lncRNA and mRNA, which just confirms whether they have protein-coding functions [50].
LncRNA differences between plants and animals were shown as the result of differences caused by imbalance and lack of training data. Judging from the training results of uneven datasets (see Figure 9), prediction models of animals are more dependent on the balance of positive and negative data samples. This may be because the current research on animal lncRNA is more comprehensive, and the mechanism coverage is wider, resulting in the lack of lncRNA datasets and greatly reducing the accuracy of prediction [51,52]. In contrast, the identified plant lncRNA is scarce, and its research field still has a vast, mysterious area [51][52][53]. Therefore, the imbalance of the plant dataset has little effect on the prediction accuracy.
According to the comparison of the various methods, we found the advantages and biases of these tools in distinguishing RNA transcripts. As seen from the testing results (see Tables 3 and 4), CPAT and CPC2 can quickly predict the coding ability of transcripts, but the accuracy of CPAT is higher. The poor prediction performance of CPC2 may be due to its prediction only by training human RNA transcripts despite the fact that the transcripts of different species have high complexity and inconsistency. Besides, LncFinder got higher SPE, ACC and MCC scores, especially on cows and O. sativa, and the overall classification effect is better than CPAT, CPC2 and PLEK. Especially in terms of plant tools, the poor prediction of PLncPRO may be due to the consensus models for dicots and monocots [19]. At the same time, RNAplonc is biased to differentiate lncRNA especially on O. sativa. PreLnc has the best predictive performance on some species, such as humans and A. thaliana, rather than cows and O. sativa. The lack of species training data and the complexity of the transcript may affect the PreLnc model. In addition, the predicted parameters of animals were lower than those of plants as a whole, which may be the fact that the lncRNAs and mRNAs of animals themselves are more complex and diverse [52,54]. Combined with PreLnc prediction results on other species (see Figure 11), the PreLnc model is proved to have good effectiveness, universality and generalization.
In this study, these results all suggested that lncRNA is less conserved among different species, and transcription sequence was certainly difficult to be used for reference among species [55,56]. Plant lncRNA research is still in its infancy and has great research value, which may reveal unknown new mechanisms that control plant growth and differentiation [53,57]. When studying and analyzing differences in the overall RNA transcripts of animals, a more comprehensive and systematic data structure should be taken to solve the impact of high differentiation. Moreover, further studies can be conducted to compare transcript sequences from functional verification and mechanism studies.

Conclusions
Nowadays, a variety of tools were developed for lncRNAs prediction, most of which use scientific calculation methods to predict sequences, to accelerate the annotation of unknown genomes for the Human Genome Project. In this work, in addition to screening the tri-nucleotides of species with high genetic similarity as features, other features, such as sequence definition, composition and function, were extracted to propose a more effective method. Compared with other tools, PreLnc can be directly obtained from the transcripts and have certain expansibility, universality and fault tolerance. PreLnc is a convenient and user-friendly tool, but it still needs to be improved in terms of computing speed. On the whole, PreLnc has good predictive performance and supports the prediction of lncRNAs for multiple species.