Classiﬁcation of Precursor MicroRNAs from Different Species Based on K-mer Distance Features

: MicroRNAs (miRNAs) are short RNA sequences that are actively involved in gene regulation. These regulators on the post-transcriptional level have been discovered in virtually all eukaryotic organisms. Additionally, miRNAs seem to exist in viruses and might also be produced in microbial pathogens. Initially, transcribed RNA is cleaved by Drosha, producing precursor miRNAs. We have previously shown that it is possible to distinguish between microRNA precursors of different clades by representing the sequences in a k-mer feature space. The k-mer representation considers the frequency of a k-mer in the given sequence. We further hypothesized that the relationship between k-mers (e.g., distance between k-mers) could be useful for classiﬁcation. Three different distance-based features were created, tested, and compared. The three feature sets were entitled inter k-mer distance, k-mer location distance, and k-mer ﬁrst–last distance. Here, we show that classiﬁcation performance above 80% (depending on the evolutionary distance) is possible with a combination of distance-based and regular k-mer features. With these novel features, classiﬁcation at closer evolutionary distances is better than using k-mers alone. Combining the features leads to accurate classiﬁcation for larger evolutionary distances. For example, categorizing Homo sapiens versus Brassicaceae leads to an accuracy of 93%. When considering average accuracy, the novel distance-based features lead to an overall increase in effectiveness. On the contrary, secondary-structure-based features did not lead to any effective separation among clades in this study. With this line of research, we support the differentiation between true and false miRNAs detected from next-generation sequencing data, provide an additional viewpoint for conﬁrming miRNAs when the species of origin is known, and open up a new strategy for analyzing miRNA evolution.


Background
Dysregulation of gene expression is a hallmark of many diseases. MicroRNAs (miR-NAs) are also expressed, and their differential regulation has been explored in many studies aimed at finding disease markers. MicroRNAs are post-transcriptional regulators that modify protein expression. Their production starts with expression as primary miRNAs (pri-miRNAs) followed by processing by Drosha in the nucleus [1]. This processing leads to precursor miRNAs (pre-miRNAs) that are exported into the cytosol. Mature miRNAs are single-stranded RNA sequences of 18-24 nt in length, which are incorporated as the recognition element into RISC. They are produced from precursor miRNAs via Dicer processing. Once incorporated into RISC, the loaded miRISC complex can act on its target messenger RNAs, with the mature miRNA providing targeting specificity.
Although microRNAs have been found in large parts of the phylogenetic tree, the molecular pathways of plants and animals may have evolved independently [2]. However, both pathways share the general processing from pri-miRNA to mature miRNA with the production of pre-miRNA and the final incorporation into a protein complex involved in translational silencing. While plants are eukaryotes and can be expected to have an miRNA pathway [3], finding miRNAs in viruses may be surprising, but it is efficient for them to encode a small RNA with a potentially large effect [4]. It is important to note that miRNAs are only functional if they are coexpressed with their targets [5]. When both miRNA and target mRNA are present, modulation of the targets' protein expression can occur [6]. MicroRNAs are not expressed at all times and in all tissues, and some may only be expressed in response to cellular stresses. Therefore, it is not possible to determine all miRNAs, their targets, and their interactions experimentally. Hence, computational approaches for miRNA prediction are important, and many such approaches are available [7][8][9]. A large part of the tools are based on machine learning (ML). With a few notable exceptions [10][11][12], two-class classification is the basis for ML-based miRNA prediction tools. Although it has been found that the complete procedure is essential for ML model establishment [13], one important factor is that negative data used in model establishment comes without any quality guarantee. Positive data, while also containing questionable examples [14][15][16], is generally of higher quality because positive interactions can be queried in the lab. In contrast, it would be difficult to do so for negative examples. MicroRNAs and their targets are collected in databases. Examples of such databases are miRTarBase [17], TarBase [18], and MirGeneDB [15], which generally depend on miRBase [19], which is the primary collection of all miRNAs.
To be able to apply machine learning, it is essential to represent pre-miRNAs in a vector space. Therefore, feature extraction and the types of features used are crucial for model performance [20]. An abundance of features for encoding miRNAs and their secondary structure have been proposed [21]. Since miRNA genesis is a multistep process involving several protein complexes, the structural features of pre-miRNAs are essential [22]. Krol et al. [23] evaluated some miRNA features experimentally using biochemical methods to discover precursor structures and compared their findings to predictive approaches. They found some differences between methods, which were more pronounced for the overall predicted precursor structure but did not strongly affect the stability of its termini.
Almost all published features have been implemented by Saçar Demirci et al. [20] and miRNAfe [24]. Features can be differentiated into sequence-based, thermodynamic, probabilistic, structural, or a mixture thereof. All features can further be normalized, e.g., using another feature. Typically, features such as stem length or the number of stems are used for normalization. The tool izMiR implements all ML-based approaches and especially dissects the various feature sets that are used in ML-based miRNA prediction [20].
K-mers are short nucleotide sequences with the length k, and they have been used for ML-based ab initio detection of pre-miRNAs from the onset [25]. We were interested in finding out whether the pre-miRNA sequence (not considering the secondary structure) contains enough discriminating power to categorize miRNAs among species. It could be hypothesized that there may be sequence-based recognition via the protein machinery of the miRNA pathway. Additionally, we took into account the phylogenetic relationship among species to investigate whether there is a discernable difference allowing the separation of the miRNAs of various species.
We set forth to answering these questions while removing the uncertainty of the quality of the negative data. For this, we employed the pre-miRNAs of one species/clade as positive data and the pre-miRNAs of another species/clade as negative data [26]. Hence, we only used positive data for two-class classification. For ML with the selected data, we used the random forest algorithm. We found that species that are phylogenetically distantly related could be distinguished by ML models established in this way. In another study [27], we had employed information-theoretic features to investigate their ability to categorize miRNAs among species/clades. However, we found that the usage of information-theoretic features did not outperform k-mers for this type of analysis. In this work, novel features based on k-mers, more specifically the distance among k-mers within a sequence, were used. The results were compared to k-mer performance and other previously published features. The average performance of the k-mer distance features was Algorithms 2021, 14, 132 3 of 15 a bit higher than k-mers alone (~1%). On the other hand, they were somewhat less effective than selected features from all categories (~0.6% on average). We did not see any increase in performance when combining the k-mer distance-based feature set with the simple k-mer features. The novel k-mer distance features were, however, better at categorizing miRNAs at closer evolutionary distances. In conclusion, k-mer distance features can be useful for future studies aimed at categorizing miRNAs into their species of origin, especially for closer evolutionary distances. Categorizing miRNAs to their species of origin can aid in contaminant detection when predicting miRNAs from next-generation sequencing data. It further adds another line of evidence for miRNAs predicted from genomes and finally may present a different vantage point for the analysis of miRNA evolution. It is our aim for the future to create an automated system that can categorize miRNAs into their clade/species of origin.

Feature Space of Precursor miRNA
One important step in applying machine learning is the step of feature extraction and representation. In our data, the examples are given as a sequence of nucleotides where each sequence is a combination of four letters: A, T, C, and G. One possible representation of each sequence in a vector space is (freq of A, freq of T, freq of C, and freq of G), where freq is the frequency of the letter in the given sequence. However, this kind of representation was not successful for reaching high performance. In general, one would convert each miRNA sequence into vector v = (v 1 , v 2 , . . . , v n ), where each v i corresponds to a specific feature. For precursor miRNA, different studies have considered k-mer representation. Recently, we have shown that k-mers are sufficient to allow categorizing of pre-miRNAs into species [9].

K-mer Features
Sequence-based features are commonly used for ML-based precursor miRNA analysis. Sequence-based features include patterns that can be derived from miRNAs or short words. Sequences typically consist of the nucleotides {A, U, C, G}. Short sequences with a particular length k can be referred to as k-mers or n-grams. For instance, 1-mers are the four possible "words" A, U, C, and G. Similarly, 2-mers consist of two adjacent nucleotides forming the words AA, AC, . . . , UU. Here, we used at most k = 3, which implies 64 short nucleotide sequences ranging from AAA to UUU. The number of k-mers up to and including k can be established by the following formular: ∑ k 0 4 i . Although higher values for k have been explored [28], our preliminary tests showed that higher values of k did not add significant improvements but led to a dramatic increase of the feature space. Therefore, we chose 1-, 2-, and 3-mers as features for this work. The k-mer counts were not used directly, but their frequency was established via normalization by the length of the sequence (i.e., len(sequence) − k + 1). In total, 84 features were calculated per miRNA given k = {1, 2, 3}. The k-mer frequency ranges between 0 and 1 (0 if the k-mer is not present in the sequence and one if the sequence is a repeat of a mononucleotide). The latter is unlikely to be judged as a miRNA as it would not form a secondary structure.

K-mer Distance Features
The new set of suggested features is based on the position of the k-mers within the pre-miRNA sequences. To capture the location of the k-mers and their relationship to other k-mers within the same sequence, we created three approaches. For each approach, we generated 84 features corresponding to one of the 84 k-mer features with k ≥ 1 and k ≤ 3.

Inter K-mer Distance
K-mers are distributed over the miRNA sequence, and the distance between k-mers may be important. Therefore, we designed a feature that calculates the distance between the first occurrence and last occurrence of each k-mer within the examples. The overall score is then the sum of these distances, which is further normalized by the sequence length ( Figure 1). K-mers are distributed over the miRNA sequence, and the distance between k-mers may be important. Therefore, we designed a feature that calculates the distance between the first occurrence and last occurrence of each k-mer within the examples. The overall score is then the sum of these distances, which is further normalized by the sequence length ( Figure 1).

Figure 1.
The description of the algorithm calculating the inter k-mer distance features from the pre-miRNA sequence.

K-mer First-Last Distance
While the inter k-mer distance considers the distance between the first and last occurrence of a k-mer, this one measures the first and last occurrence of each k-mer directly. The measure is normalized using sequence length (S: sequence, dfl: distance first last): dfl = (last location of k-mer in S − first location of k-mer S)/len(S)

K-mer Location Distance
For each k-mer, all locations within the pre-miRNA sequence are recorded (loci in Figure 2). Then, the average distance among all locations is calculated (dl = dl/|loci|). In case a k-mer does not occur in the sequence, the location distance is ste to −1. In case there is only one occurrence of the k-mer in the sequence, the location distance is set to 0.

K-mer First-Last Distance
While the inter k-mer distance considers the distance between the first and last occurrence of a k-mer, this one measures the first and last occurrence of each k-mer directly. The measure is normalized using sequence length (S: sequence, dfl: distance first last): dfl = (last location of k-mer in S − first location of k-mer S)/len(S)

K-mer Location Distance
For each k-mer, all locations within the pre-miRNA sequence are recorded (loci in Figure 2). Then, the average distance among all locations is calculated (dl = dl/|loci|). In case a k-mer does not occur in the sequence, the location distance is ste to −1. In case there is only one occurrence of the k-mer in the sequence, the location distance is set to 0.

Secondary Features
Similar to what is described in [9], secondary-structure-based features were calculated for the miRNA examples. The features that were extracted were the number of loops in the structure, the number of base pairs in the stem, and the number of bulges. The number of loops was also extracted for a set size from 1 to 6. All loops greater than 6 were

Secondary Features
Similar to what is described in [9], secondary-structure-based features were calculated for the miRNA examples. The features that were extracted were the number of loops in the structure, the number of base pairs in the stem, and the number of bulges. The number of loops was also extracted for a set size from 1 to 6. All loops greater than 6 were combined in one measure. If a loop has an even number it is symmetric, otherwise it is asymmetric.
We created a KNIME workflow [29] to extract those features. We did not calculate the secondary structure but used the one provided by miRBase [30].

Other Features Describing Pre-miRNAs
For the parameterization of pre-miRNAs, an abundance of features have been published [21]. These parameters can be categorized into sequence features such as k-mers [31], structural features such as the number of bulges [32], thermodynamic ones such as minimum free energy [33], and combinations like the triplet feature [34] consisting of one nucleotide and its adjacent secondary structure. Via normalizing or transforming the features, for example, into p-values [35], additional features are created, adding to the total number of parameters. The set of parameters used in different studies have influenced the prediction success, and most approaches have recently been compared [20]. All previous studies have used known pre-miRNAs as positive data. These studies used pseudo negative data to train ML classifiers. This significant contribution of this study is that no pseudo negative data were used in training ML classifiers. Instead, known miRNAs from one species were employed as positive data, while the negative data were the known miRNAs of another species. For this scenario, we have shown that it is effective to use k-mers or sequence motifs [26,36]. Performing information theory-based transformation of features [27], while effective by themselves, adds little to the classification accuracy. Here, we introduced three new sets of features (see above) and compared the previously described features for pre-miRNA detection. Out of the more than a thousand features for this (including normalized ones), we finally selected 100 in this study. The selection was based on the correlation among features (the lower the better) and the information gain (higher is better) of the features.

Datasets and Methods
Preprocessing the Data Data from 15 clades were collected for this study ( Table 1). The 16th example in Table 1 contains the sequences of Homo sapiens that were extracted from its clade Hominidae. Some miRNAs are highly conserved and exist in many copies throughout the data. This could create biased models, so we removed highly similar sequences, leaving just one representative. All data from all clades and Homo sapiens were combined and clustered using USEARCH [37]. From the resulting clusters, we chose one representative. This effectively created a dataset consisting of only nonhomologous sequences. Clades were re-established from the filtered dataset, but the homologous sequences between clades were not reintroduced after filtering.

Feature Vector and Feature Selection
Many features to parameterize pre-miRNAs are known, and this study employed 831 features used in a previous study [21]. Additionally, novel k-mer distance features were introduced. For feature selection, we employed information gain (IG) [38,39] as it is implemented in KNIME (version 3.1.2) [29].
Information gain was used for feature selection by evaluating each feature/variable's information content in the context of the target variable. The formula to compute IG is where H(C) = ∑ c∈C p(C) log p(C) represents the entropy of the class, and H(C|A) is the conditional entropy of the class provided feature A:

Classification Approach
Similar to the study of [26], we trained random forest (RF) classifiers [40] using the RF implementation of KNIME [29]. For training and testing, we split the overall data into 80% training and 20% testing data. We used undersampling of the majority class to force positive and negative examples to equal amounts. Cross-validation for model performance estimation was performed using 100-fold Monte Carlo cross-validation (MCCV) [41]. For training and testing, we used the default setting for the RF implementation by KNIME.

Model Performance Evaluation
Model performance was assessed using, for example, the Matthews's correlation coefficient (MCC) [42]. Sensitivity, specificity, and accuracy were other measures used for the evaluation of model performance. In the following, the presented results always refer to the average model performance for 100-fold MCCV.

Results and Discussion
We have previously shown that k-mers can categorize miRNAs into species [27]. To improve the categorization, we devised three new sets of features also based on kmers. The data used for categorization derives from clades with sufficient amounts of examples for classification ( Table 1). The data represents various evolutionary distances ( Figure 1). Previously, we have shown that it is possible to categorize species accurately at larger evolutionary distances, but categorizing becomes more difficult with decreasing evolutionary distance [26]. Therefore, it is important to have examples from varying evolutionary distances, in this case, also from various kingdoms (Figure 3).  With the results in Figure 3, our previous observation regarding the effectiveness of k-mer features [26,27], namely that the average model accuracy also increases with increasing evolutionary distance, can be confirmed. Although we have removed similar sequences between Homo sapiens and its clade Hominidae, the performance is low (0.14 ACC), which indicates that some "hidden message" encoded in perhaps the triplet bias is conserved for the whole Hominidae clade. This observation is even more intriguing as the performance of Embryophyta versus its child clades is much better (0.70-0.78 ACC). Perhaps this is due to the larger evolutionary distance between Embryophyta and its child clades selected here, or the miRNAs may not be as strictly conserved within the plant kingdom.
Considering the closest clades in the lower side of the phylogenetic tree (Figure 1), Cercopithecidae, Homo sapiens, and Rodentia, the performance is also the lowest (63-65 ACC; Table 1). Surprisingly, the performance of Brassicaceae and Malvaceae, which are Each combination of species and clades needs a specific classifier, and these were trained using 100-fold MCCV. Initially, the known k-mer features were evaluated with respect to categorization accuracy ( Figure 3). With one clade used for positive data and the other as negative data, 100 models were established using 80% training and 20% testing data. Classifier performance represents the average performance of 100-fold MCCV. These computations lead to a matrix where both columns and rows represent clades. Considering Aves versus Hexapoda as an example, the average accuracy amounts to 0.87 at 100-fold MCCV. This performance is 5% better than the general average of all established models. All the accuracy values on the diagonal of the table are the results of the clade with itself. This categorization is obviously very difficult, and accordingly, all accuracy values along the diagonal are very low.
With the results in Figure 3, our previous observation regarding the effectiveness of kmer features [26,27], namely that the average model accuracy also increases with increasing evolutionary distance, can be confirmed. Although we have removed similar sequences between Homo sapiens and its clade Hominidae, the performance is low (0.14 ACC), which indicates that some "hidden message" encoded in perhaps the triplet bias is conserved for the whole Hominidae clade. This observation is even more intriguing as the performance of Embryophyta versus its child clades is much better (0.70-0.78 ACC). Perhaps this is due to the larger evolutionary distance between Embryophyta and its child clades selected here, or the miRNAs may not be as strictly conserved within the plant kingdom.
Considering the closest clades in the lower side of the phylogenetic tree (Figure 1), Cercopithecidae, Homo sapiens, and Rodentia, the performance is also the lowest (63-65 ACC; Table 1). Surprisingly, the performance of Brassicaceae and Malvaceae, which are also very closely related, is much better (74).
K-mers have been introduced to parameterize pre-miRNAs early on [25], and many other sequence-based features have followed. Here, we did not use k-mers directly but transformed them to create new parameters do describe pre-miRNAs. Differentiation between pre-miRNAs from different species/clades was tested in the same manner as for k-mers ( Figure 4).  The inter k-mer distance was tested first (Figure 4). Like k-mers (Figure 3), roughly three hotspots with high accuracy for differentiation among species/clades can be observed. The most striking result is that virus pre-miRNAs seem to be very different from pre-miRNAs from the plant kingdom but much closer to pre-miRNAs from the metazoan clade. Since most viruses in the list infect animals, this may imply that their pre-miRNAs are functional in the host rather than for virus regulative activities, similar to what we had earlier proposed for Toxoplasma gondii [41,42]. The other hotspots separate kingdoms from each other, which supports the previous observation that the performance of each pair of clades is correlated to the phylogenetic distance between the two clades. In summary, kmer features alone are on average about 2% better than inter k-mer distance features.
Employing our new k-mer first-last distance ( Figure 5) leads to similar results as for k-mers ( Figure 3) and inter k-mer distance (Figure 4). The average performance for k-mer The inter k-mer distance was tested first (Figure 4). Like k-mers (Figure 3), roughly three hotspots with high accuracy for differentiation among species/clades can be observed. The most striking result is that virus pre-miRNAs seem to be very different from pre-miRNAs from the plant kingdom but much closer to pre-miRNAs from the metazoan clade. Since most viruses in the list infect animals, this may imply that their pre-miRNAs are functional in the host rather than for virus regulative activities, similar to what we had earlier proposed for Toxoplasma gondii [41,42]. The other hotspots separate kingdoms from each other, which supports the previous observation that the performance of each pair of clades is correlated to the phylogenetic distance between the two clades. In summary, k-mer features alone are on average about 2% better than inter k-mer distance features.
Employing our new k-mer first-last distance ( Figure 5) leads to similar results as for k-mers ( Figure 3) and inter k-mer distance (Figure 4). The average performance for k-mer first-last distance is equal to k-mers. Interestingly, k-mer first-last distance is slightly better for viruses, Embryophyta, and Laurasiatheria. Therefore, we checked the difference between k-mer and k-mer first-last distance results ( Figure 6). It is interesting to see that k-mer first-last distance performance is better for more closely related species. At the same time, k-mers seem to be better at larger evolutionary distances. This difference can be leveraged when a rough categorization has already been achieved with some prior method.
Algorithms 2021, 14, 132 10 of 16 Figure 5. Results in the figure stem from k-mer first-last distance features only. Average accuracy for 100-fold MCCV using a random forest classifier and a split of 80% training and 20% testing. Yellow shades refer to lower accuracy, while red shades indicate higher average classification accuracy. Figure 6. Difference between k-mer first-last distance and k-mers only. Yellow shades indicate better performance for k-mer first-last distance, while red shades indicate higher average classification accuracy for k-mers. However, combining k-mer and the three k-mer distance feature sets and selecting the top 100 features according to information gain leads to only a slight increase in accuracy over k-mer features alone (~1%).
For comparison, we used other features, including structural features such as the number of bulges [32], thermodynamic ones such as minimum free energy [33], and combinations such as the triplet feature [34] consisting of one nucleotide and its adjacent folding structure. The same procedure that was used for the establishment of k-mer and k-mer distance models was employed. The results are reported in Figure 7. The average accuracy is 0.83, just like the one for k-mer and k-mer distance features combined. The areas of Figure 7 with high accuracy are also mostly similar to the areas for k-mers ( Figure 3) and the combination of k-mer features. Some of the features, especially the probabilistic ones, are computationally expensive and lead to long run times. It can be concluded that their calculation is not warranted when aiming to categorize pre-miRNAs to their species. Figure 5. Results in the figure stem from k-mer first-last distance features only. Average accuracy for 100-fold MCCV using a random forest classifier and a split of 80% training and 20% testing. Yellow shades refer to lower accuracy, while red shades indicate higher average classification accuracy. Figure 6. Difference between k-mer first-last distance and k-mers only. Yellow shades indicate better performance for k-mer first-last distance, while red shades indicate higher average classification accuracy for k-mers.
For comparison, we used other features, including structural features such as th number of bulges [32], thermodynamic ones such as minimum free energy [33], and com binations such as the triplet feature [34] consisting of one nucleotide and its adjacent fold ing structure. The same procedure that was used for the establishment of k-mer and k mer distance models was employed. The results are reported in Figure 7. The averag accuracy is 0.83, just like the one for k-mer and k-mer distance features combined. Th areas of Figure 7 with high accuracy are also mostly similar to the areas for k-mers (Figur 3) and the combination of k-mer features. Some of the features, especially the probabilisti ones, are computationally expensive and lead to long run times. It can be concluded tha their calculation is not warranted when aiming to categorize pre-miRNAs to their species Figure 6. Difference between k-mer first-last distance and k-mers only. Yellow shades indicate better performance for k-mer first-last distance, while red shades indicate higher average classification accuracy for k-mers.  Yellow shades indicate better performance for k-mer first-last distance, while red shades indicate higher average classification accuracy for k-mers.
Since categorization was very successful using non-sequence-based features, we decided to extract features that focus on structure, similar to Yousef et al., 2006 [9]. The results are presented in Figure 8. While the higher accuracy areas in Figure 8 are similar to the above results, the overall accuracy is far below (10%) the accuracy achieved with kmer or k-mer distance-based features (Figure 9). Since categorization was very successful using non-sequence-based features, we decided to extract features that focus on structure, similar to Yousef et al., 2006 [9]. The results are presented in Figure 8. While the higher accuracy areas in Figure 8 are similar to the above results, the overall accuracy is far below (10%) the accuracy achieved with k-mer or k-mer distance-based features (Figure 9). ter performance for k-mer first-last distance, while red shades indicate higher average classification accuracy for k-mers.
Since categorization was very successful using non-sequence-based features, we decided to extract features that focus on structure, similar to Yousef et al., 2006 [9]. The results are presented in Figure 8. While the higher accuracy areas in Figure 8 are similar to the above results, the overall accuracy is far below (10%) the accuracy achieved with kmer or k-mer distance-based features (Figure 9).  In summary (Figure 9), it can be seen that secondary-structure-based features display less performance for the categorization of miRNAs among species/clades. There appears to be little difference in employing all published features, the top 100 selected ones (based on low correlation among features and high information gain), k-mer, or k-mer distance features. There still appears to be sequence information in the selected features. For example, the triplet structure features consist of a nucleotide and the local hybridization pattern: N, where N is any nucleotide, and (non)bonds are represented by dots and parentheses, respectively. Among the novel features, the k-mer location distance performed best. It achieved a comparable accuracy to using k-mers alone (data not shown).

Top Features
We ranked all the distance features according to information gain. For more details, see our GitHub repository, which lists the ranked features for each pair of clades. We merged the distance features "Inter k-mer distance" and "k-mer first-last distance" for In summary (Figure 9), it can be seen that secondary-structure-based features display less performance for the categorization of miRNAs among species/clades. There appears to be little difference in employing all published features, the top 100 selected ones (based on low correlation among features and high information gain), k-mer, or k-mer distance features. There still appears to be sequence information in the selected features. For example, the triplet structure features consist of a nucleotide and the local hybridization pattern: N, where N is any nucleotide, and (non)bonds are represented by dots and parentheses, respectively. Among the novel features, the k-mer location distance performed best. It achieved a comparable accuracy to using k-mers alone (data not shown).

Top Features
We ranked all the distance features according to information gain. For more details, see our GitHub repository, which lists the ranked features for each pair of clades. We merged the distance features "Inter k-mer distance" and "k-mer first-last distance" for each pair for this analysis. Table 2 confirms the expectation that different features are important for the categorization of different pairs of clades. While Inter k-mer distance features are most prominent, the particular sequence that is most discriminating varies. A complete analysis can be found on our GitHub repository.

Conclusions
MicroRNAs are small noncoding RNA sequences that are involved in post-transcriptional gene regulation, modulating protein abundance. Representatives of these miRNAs are found throughout the tree of life. How miRNA evolved is under current investigation [42]. This is related to our previous research, where we had shown the possibility of categorizing miRNAs to their species of origin [26,27].
In this study, we designed three transformations of k-mers (k-mer location distance, k-mer first-last, and inter k-mer) for miRNA parameterization. These features were used in machine learning to differentiate between miRNAs from different species/clades. The distinction performance was compared to using k-mer features and previously published features. Random forest models were established using 100-fold MCCV, with one species or clade contributing the positive class and another posing as the negative one. To assess the categorization effectiveness into species/clades, examples needed to be selected from a wide range of phylogenetic distances (Table 1, Figure 1). More than 100 models were established per feature set comparison, leading to the establishment and comparison of about 100,000 models for this study.
As can be expected, more distant species/clades can be categorized more effectively, which can be seen by the clustering of higher average accuracy (Figures 3-5, 7 and 8). The finding that k-mer features can categorize well at larger evolutionary distances confirms our previous results [26,27]. Using only the features k-mer inter or k-mer location distance leads to a similar performance compared to employing only k-mer features. Combining all features followed by the selection of the top 100 features (based on low correlation among features and high information gain) slightly increases the average accuracy by 1% (Figure 9). The latter is equal to the effectiveness of using all published features. However, calculating all these features is very time-consuming and amounted to thousands of CPU hours for this study. Naturally, these features contain many sequence-based ones. In an attempt to evaluate the contribution of the secondary structure to the categorization of miRNAs, we selected parameters describing the secondary structure of pre-miRNAs ( Figure 8). Categorization with secondary-structure-based features is less accurate (~10% on average) compared to other approaches ( Figure 9). This finding is in line with the conservation of structure over the conservation of sequence paradigm. A more in-depth comparison of results showed that the effectiveness is not equal when considering k-mers and k-mer distance features ( Figure 6). K-mer distance features are slightly more effective for categorization at closer evolutionary distances.
In conclusion, k-mer and k-mer distance features together lead to more accurate categorization at larger evolutionary distances. An example could be the categorization of miRNAs into Homo sapiens versus Brassicaceae, which achieves an average accuracy of 93%. The novel distance-based transformation of the k-mer features increase the average accuracy at closer evolutionary distances. The use of secondary-structure-based features did not lead to a favorable performance in this study (Figure 9). We hope that these findings will provide a new angle to study miRNA evolution. More practically, categorization of miRNAs to their species of origin can help ensure that predicted miRNAs conform to the expectation of the organism they are predicted for. Similarly, when predicting miRNAs from NGS data, categorizing the predictions into their species of origin can reveal whether the predictions are contaminations or not. To better support these processes, we aim to implement a fully automatic categorization method in the future.