1. Introduction
Transposable elements (TEs) are genomic units able to move within and among the genomes of virtually all organisms [
1]. They are the main contributors to genomic diversity and genome size variations [
2], except for of polyploidy events. Also, TEs perform key genomic functions involved in chromosome structuring, gene expression regulation and alteration, adaptation and evolution [
3], and centromere composition in plants [
4]. Currently, an important issue in genome sequence analyses is to rapidly identify and reliably annotate TEs. However, there are major obstacles and challenges in the analysis of these mobile elements [
5], including their repetitive nature, structural polymorphism, species specificity, as well as high divergence rate, even across close relative species [
6].
TEs are traditionally classified according to their replication mode [
7]. Elements using an RNA molecule as an intermediate are called Class I or retrotransposons, while elements using a DNA intermediate are called Class 2 or transposons [
8]. Each class of TEs is further sub-classified by a hierarchical system into orders, superfamilies, lineages, and families [
9].
Several bioinformatic methods were developed to detect TEs in genome sequences, including homology-based, de novo, structure-based, and comparative genomic, but no combination of them can provide a reliable detection in a relatively short time [
10]. Most of the algorithms currently available use a homology-based approach [
11], displaying performance issues when analyzing elements in large plant genomes. In the current scenario of large-scale sequencing initiatives, such as the Earth BioGenome Project [
12], disruptive technologies and innovative algorithms will be necessary for genome analysis in general and, particularly, for the detection and classification of TEs that represent the main portion of these genomes [
13].
In recent years, several databases consisting of thousands of TE at all classification levels of several species and taxa have been created and published [
3]. Furthermore, these databases have different characteristics, such as containing consensus [
14,
15,
16] or genomic [
17,
18] TE sequences, coding domains [
9,
19], and also TE-related RNA [
20,
21]. These databases have been constructed with the TEs detected in species sequenced using bioinformatics approaches (commonly based on homology or structure), which can produce false positive if there is no a curation process [
11]. As other biological sets (such as datasets of splice sites [
22], or protein function predictions [
23]), databases have distinct numbers of different types of TEs producing unbalanced classes [
23]. For example in PGSB, the largest proportion of the elements corresponds to retrotransposons (at least 86%) [
24]. The above is caused by the replication mode of each TE class. As in other detection tasks, the negative instances for identifying TEs are all other genomic elements than TEs (that constitute the positive instances) [
25,
26,
27], such as introns, exons, CDS (coding sequences), and simple repeats, among others, making the negative class multimodal. These databases constitute valuable resources to improve tasks like TE detection and classification using bioinformatics or also novel techniques such as machine learning (ML).
ML is defined as a set of algorithms that can be calibrated based on previously processed data or past experience [
28] and a loss function through an optimization process [
29] to build a model. ML is applied to different bioinformatics problems, including genomics [
30], systems biology, evolution [
28], and metagenomics [
31], demonstrating substantial benefits in terms of precision and speed. Several recent studies using ML to detect TEs report drastic improvements in the results [
32,
33,
34] compared to conventional bioinformatics algorithms [
13].
In ML, the selection of adequate metrics that measure the algorithms’ performance is one of the most crucial and challenging steps. Commonly used metric for classification tasks are accuracy, precision, recall, and ROC curves [
35,
36], but they are not appropriate for all datasets [
37], especially when the positive and negative datasets are unbalanced [
13]. Accuracy and ROC curves can be meaningless performance measurements in unbalanced datasets [
22], because it does not reveal the true classification performance of the rare classes [
38]. For example, ROC curves are not commonly used in TE classification, because only a small portion of the genome contains certain TE superfamilies [
34]. On the other hand, precision and recall can be more informative since precision is the percentage of predictions that are correct [
34] and recall is the percentage of true samples that are correctly detected [
26], nevertheless it is recommended to use them in combination with other metrics since the use of only one of these metrics cannot provide a full picture of the algorithm performance [
36].
Most of the classification and detection tasks addressed by ML define two classes, positive and negative [
13]. Thus, expected results can be classified as true positive (tp) if they were classified as positive and are contained in the positive class, while as false negatives (fn) if they were rejected but did not belong to the negative class. On the other hand, samples that are contained in negative class and predicted to be positive constitute false positives (fp), or true negative (tn) if they are not [
13,
28,
39]. These markers are related in the confusion matrix, and most of the metrics used in ML are calculated based on this matrix.
Depending on the goal of the application and the characteristics of the elements to be classified, other metrics addressing classification (binary, multiclass, hierarchical), class balance (i.e., if training dataset is imbalanced or not), and the importance of positive or negative instances [
36] must be considered. Another point is the ability of a metric to preserve the value under a change in the confusion matrix, called measure invariance [
40]. This properties give comparative parameters between metrics that are not based on datasets, but in the way they are calculated. Each of the properties of the invariance can be beneficial or unfavorable depending on the main objectives, the balance of the classes, the size of the data sets, the quality, and the composition of the negative class, among others [
40]. Thus, invariance properties are useful tools in order to select the most informative metrics in each ML problem.
Recently, different ML-based software have been developed to tentatively detect repetitive sequences [
34,
41,
42], classify them (at the order or superfamily levels) [
27,
43,
44,
45], or both [
10,
46]. Additionally, deep neural networks-based software were also developed to classify TEs [
11,
47]. Nevertheless, there are no studies about which metrics can be more suitable taking into account the unique characteristics of transposable element datasets and their dynamic structure. Here, we evaluated 26 metrics found in the literature for TE detection and classification, considering the main features of this type of data, the invariance properties and characteristics of each metric in order to select the more appropriate ones for each type of classification.
4. Discussion
The detection and classification of transposable elements is a crucial step in the annotation of sequenced genomes, because of their relation with genome evolution, gene function, regulation, and alteration of expression, among others [
74,
75]. This step remains challenging given their abundance and diverse classes and orders. In addition, other characteristics of TEs, such as a relatively low selection pressure and a more rapid evolution than coding genes [
26], their dynamic evolution due to insertions of other TEs (nested insertion), illegitimate and unequal recombination, cellular gene capture, and inter-chromosomal and tandem duplications [
76], make them difficult targets for accurate and rapid detection and classification procedures. Indeed, TEs showing uniform structures and well-established mechanisms of transposition can be easily clustered and classified into major groups such as orders or superfamilies (e.g., LTR retrotransposons) [
77]. However, this task is relatively complex and time-consuming when classifying TEs into lower levels, such as lineages or families [
78]. For these reasons, TE classification and annotation are complex bioinformatics tasks [
79], in which, in some cases, manual curation of sequences is required by specialists. The ability of biologists to sequence any organism or a group of organisms in a relatively short time and at relatively low costs redefines the barrier of the genomic information. The current limitation is not the generation of genome sequences but the amount of information to be processed in a limited time. Complex bioinformatics tasks may be accomplished by machine learning algorithms, such as in drug discovery and other medical applications [
80], genomic research [
38,
81], metagenomics [
31,
82], and multiple applications in proteomics [
83].
Previous works apply ML and DL for TE analysis, such as Arango-López et al. (2017) [
43] for the classification of LTR-retrotransposons, Loureiro et al. (2012) [
84] for the detection and classification of TEs using developed bioinformatics tools, and Ashlock and Datta (2012) [
69] distinguishing between retroviral LTRs and SINEs (short interspersed nuclear elements). Deep neural networks (DNN) are also used to hierarchically classify TEs by applying fully connected DNN [
11] and through convolutional neural networks (CNN) and multi-class approaches [
47].
In TE detection and classification, the dataset could be highly imbalanced [
23]; therefore, commonly used metrics such as accuracy and ROC curves may not be fully adequate [
36]. For the detection task, the positive class will be much lower than the negative, because the latter will have all other genomic elements. In classification, each type of TE (classes, orders, superfamilies, lineages, or families) has different dynamics that produce a distinct number of copies. For example, in the coffee genus, LTR-retrotransposons show large copy number differences depending on the lineage [
85]. In
Oryza australiensis [
86] and pineapple genomes [
87], only one family of LTR-retrotransposons contributes to 26% and 15% (Pusofa) of the total genome size, respectively.
For binary classification (for example, to detect TEs or classify them into class 1 and class 2), the most appropriate metric is F1-score (id = 7), which considers precision and recall values. Precision is a useful parameter when the number of false-positive must be limited and recall measures how many positive samples are captured by the positive predicted [
36]. However, the use of only one of these metrics cannot provide a full picture of the algorithm performance. Altogether, our results suggest that F1-score is appropriate for TE analyses.
In multi-class approaches (such as TE classification into orders, superfamilies, or lineages), F1-score (id = 20) also seems to be the most suitable metric, combined with the macro-averaging strategy, probably due to the high diversity of intra-class samples. For TE detection and classification, it appears more important to weigh all classes equally than to weigh each sample equally (micro-averaging strategy). Finally, for hierarchical classification approaches (i.e., considering the hierarchical classification of TEs proposed by Wicker and coworkers [
8]), F1-score↓ (id = 26) and F1-score↑ (id = 23) seem most suitable. These results demonstrate the importance of calculating the performance of each hierarchical level. Additionally, precision-recall curves and area under the precision-recall curve provided the best results for binary classification, demonstrating that, for TE datasets, they are more appropriate than the commonly used ROC curves.
Area under the precision-recall curve, auPRC (id = 11), is a unique metric, which showed invariance in I1 and non-invariance in I2. Its invariance properties make auPRC a robust measure of the overall performance of an algorithm and it is insensitive to the performance for a specific class (I1). However, it less appropriate for data with a multi-modal negative class (~I2).
All metrics presented invariance in I3, indicating that they could not measure true positive change. This suggests that they can be used when the positive class is not very strong. PrecisionM (id = 18) and Precision↓(id = 21) showed non-invariance in I4, which demonstrates that these metrics may be less reliable when manual labeling follows rigorous rules for a negative class. On the other hand, RecallM (id = 19) and Recall↓ (id = 22) exhibited non-invariance in I5, indicating that these metrics may not provide a conservative estimate when the positive class has outliers, as commonly found in TE datasets. Thus, these metrics might not be informative in TE detection and classification. The non-invariance properties of all metrics in I6, shown in
Table 1, demonstrated that these metrics can vary in data with large size differences. Consequently, these metrics must be used carefully for comparison with other and different datasets.
Non-invariance in I7 shown by precision (id = 18 and 21) supported the combined use of this metric with other metrics (such as in F1-score) common in ML algorithms. Finally, auPRC (id = 11), RecallM (id = 19), and Recall↓ (id = 22) may be better choices for the evaluation of classifiers if different data sizes exhibit the same quality of positive (negative) characteristics, as in the case of generated (simulated) data due to their non-invariance properties in I8.
Our tests for the multi-class classification task of LTR retrotransposons at the lineage level show an overestimation of the performance of all ML algorithms used here (
Figure 2 and
Figure 3) for both datasets (Repbase and PGSB). Furthermore, our experiments support the information found in the literature, indicating that accuracy is not the most informative metric for highly unbalanced datasets, such as those used in this study. Additionally,
Figure 2 and
Figure 3 indicate that this tendency of overestimation is generalized for nearly all the algorithms, pre-processing techniques, and coding schemes used here.
A clear exception, however, is shown by k-mers (in both training and validation datasets,
Tables S4–S7), for which accuracy and F1-scores did not show any differences. Nevertheless, if the F1-score is used in the tuning process (
Figure 4B,D,
Figure 5B,D, and
Figure 6B,D), accuracy also overestimates the performance of almost all the algorithms in comparison to F1-score, sensitivity (recall), and precision. Interestingly, RF performs in a similar manner to that of the other algorithms when PGSB (with more than 26,000 elements) is used, but DT presents the same behavior in both datasets.
When the performance of a given scheme is low, the overestimation shown by accuracy is more evident (
Figure 5 and
Figure 6). This is due to the extremely low performance on some lineages and, thus, accuracy is not very informative if it is not used combined with another metric. As suggested by the literature and invariance analyses, F1-score appears to be the most adequate and informative metric in the experiments performed here, since it is a harmonic estimate of precision and sensitivity by measuring the combined amount of false-positive and positive samples captured by the algorithm.
Overall, the results shown here can also be applied to data similar to TEs, such as retrovirus and endogenous retrovirus or data with highly imbalanced classes, high intra-class diversity, and negative multi-modal classes (in detection tasks).