Measuring Performance Metrics of Machine Learning Algorithms for Detecting and Classifying Transposable Elements

: Because of the promising results obtained by machine learning (ML) approaches in several fields, every day is more common, the utilization of ML to solve problems in bioinformatics. In genomics, a current issue is to detect and classify transposable elements (TEs) because of the tedious tasks involved in bioinformatics methods. Thus, ML was recently evaluated for TE datasets, demonstrating better results than bioinformatics applications. A crucial step for ML approaches is the selection of metrics that measure the realistic performance of algorithms. Each metric has specific characteristics and measures properties that may be different from the predicted results. Although the most commonly used way to compare measures is by using empirical analysis, a non-result-based methodology has been proposed, called measure invariance properties. These properties are calculated on the basis of whether a given measure changes its value under certain modifications in the confusion matrix, giving comparative parameters independent of the datasets. Measure invariance properties make metrics more or less informative, particularly on unbalanced, monomodal, or multimodal negative class datasets and for real or simulated datasets. Although several studies applied ML to detect and classify TEs, there are no works evaluating performance metrics in TE tasks. Here, we analyzed 26 different metrics utilized in binary, multiclass, and hierarchical classifications, through bibliographic sources, and their invariance properties. Then, we corroborated our findings utilizing freely available TE datasets and commonly used ML algorithms. Based on our analysis, the most suitable metrics for TE tasks must be stable, even using highly unbalanced datasets, multimodal negative class, and training datasets with errors or outliers. Based on these parameters, we conclude that the F1-score and the area under the precision-recall curve are the most informative metrics since they are calculated based on other metrics, providing insight into the development of an


Introduction
Transposable elements (TEs) are genomic units able to move within and among the genomes of virtually all organisms [1].They are the main contributors to genomic diversity and genome size variations [2], except for of polyploidy events.Also, TEs perform key genomic functions involved in chromosome structuring, gene expression regulation and alteration, adaptation and evolution [3], and centromere composition in plants [4].Currently, an important issue in genome sequence analyses is to rapidly identify and reliably annotate TEs.However, there are major obstacles and challenges in the analysis of these mobile elements [5], including their repetitive nature, structural polymorphism, species specificity, as well as high divergence rate, even across close relative species [6].
TEs are traditionally classified according to their replication mode [7].Elements using an RNA molecule as an intermediate are called Class I or retrotransposons, while elements using a DNA intermediate are called Class 2 or transposons [8].Each class of TEs is further sub-classified by a hierarchical system into orders, superfamilies, lineages, and families [9].
Several bioinformatic methods were developed to detect TEs in genome sequences, including homology-based, de novo, structure-based, and comparative genomic, but no combination of them can provide a reliable detection in a relatively short time [10].Most of the algorithms currently available use a homology-based approach [11], displaying performance issues when analyzing elements in large plant genomes.In the current scenario of large-scale sequencing initiatives, such as the Earth BioGenome Project [12], disruptive technologies and innovative algorithms will be necessary for genome analysis in general and, particularly, for the detection and classification of TEs that represent the main portion of these genomes [13].
In recent years, several databases consisting of thousands of TE at all classification levels of several species and taxa have been created and published [3].Furthermore, these databases have different characteristics, such as containing consensus [14][15][16] or genomic [17,18] TE sequences, coding domains [9,19], and also TE-related RNA [20,21].These databases have been constructed with the TEs detected in species sequenced using bioinformatics approaches (commonly based on homology or structure), which can produce false positive if there is no a curation process [11].As other biological sets (such as datasets of splice sites [22], or protein function predictions [23]), databases have distinct numbers of different types of TEs producing unbalanced classes [23].For example in PGSB, the largest proportion of the elements corresponds to retrotransposons (at least 86%) [24].The above is caused by the replication mode of each TE class.As in other detection tasks, the negative instances for identifying TEs are all other genomic elements than TEs (that constitute the positive instances) [25][26][27], such as introns, exons, CDS (coding sequences), and simple repeats, among others, making the negative class multimodal.These databases constitute valuable resources to improve tasks like TE detection and classification using bioinformatics or also novel techniques such as machine learning (ML).
ML is defined as a set of algorithms that can be calibrated based on previously processed data or past experience [28] and a loss function through an optimization process [29] to build a model.ML is applied to different bioinformatics problems, including genomics [30], systems biology, evolution [28], and metagenomics [31], demonstrating substantial benefits in terms of precision and speed.Several recent studies using ML to detect TEs report drastic improvements in the results [32-34] compared to conventional bioinformatics algorithms [13].
In ML, the selection of adequate metrics that measure the algorithms' performance is one of the most crucial and challenging steps.Commonly used metric for classification tasks are accuracy, precision, recall, and ROC curves [35,36], but they are not appropriate for all datasets [37], especially when the positive and negative datasets are unbalanced [13].Accuracy and ROC curves can be meaningless performance measurements in unbalanced datasets [22], because it does not reveal the true classification performance of the rare classes [38].For example, ROC curves are not commonly used in TE classification, because only a small portion of the genome contains certain TE superfamilies [34].On the other hand, precision and recall can be more informative since precision is the percentage of predictions that are correct [34] and recall is the percentage of true samples that are correctly detected [26], nevertheless it is recommended to use them in combination with other metrics since the use of only one of these metrics cannot provide a full picture of the algorithm performance [36].
Most of the classification and detection tasks addressed by ML define two classes, positive and negative [13].Thus, expected results can be classified as true positive (tp) if they were classified as positive and are contained in the positive class, while as false negatives (fn) if they were rejected but did not belong to the negative class.On the other hand, samples that are contained in negative class and predicted to be positive constitute false positives (fp), or true negative (tn) if they are not [13,28,39].These markers are related in the confusion matrix, and most of the metrics used in ML are calculated based on this matrix.
Depending on the goal of the application and the characteristics of the elements to be classified, other metrics addressing classification (binary, multiclass, hierarchical), class balance (i.e., if training dataset is imbalanced or not), and the importance of positive or negative instances [36] must be considered.Another point is the ability of a metric to preserve the value under a change in the confusion matrix, called measure invariance [40].This properties give comparative parameters between metrics that are not based on datasets, but in the way they are calculated.Each of the properties of the invariance can be beneficial or unfavorable depending on the main objectives, the balance of the classes, the size of the data sets, the quality, and the composition of the negative class, among others [40].Thus, invariance properties are useful tools in order to select the most informative metrics in each ML problem.
Recently, different ML-based software have been developed to tentatively detect repetitive sequences [34,41,42], classify them (at the order or superfamily levels) [27,[43][44][45], or both [10,46].Additionally, deep neural networks-based software were also developed to classify TEs [11,47].Nevertheless, there are no studies about which metrics can be more suitable taking into account the unique characteristics of transposable element datasets and their dynamic structure.Here, we evaluated 26 metrics found in the literature for TE detection and classification, considering the main features of this type of data, the invariance properties and characteristics of each metric in order to select the more appropriate ones for each type of classification.

Bibliography Analysis
As a literature information source, we used the results obtained by [13], who applied the systematic literature review (SLR) process proposed by [48].The authors applied the search Equation (1) to perform a systematic review of research articles, book chapters and other review papers presented in well-known bibliographic databases such as Scopus, Science Direct, Web of Science, Springer Link, PubMed, and Nature.
("transposable element" OR retrotransposon OR transposon) AND ("machine learning" OR "deep learning") Applying the Equation (1), a total of 403 publications were identified of which authors removed those which do not satisfy certain conditions such as repeated (the same study was found in different databases); of different types (books, posters, short articles, letters and abstracts); and written in other languages (languages other than English).Then, authors used inclusion and exclusion criteria in order to select interested articles.Finally, 35 publications were selected as relevant in the fields of ML and TE [13].Using these relevant publications, we identified the metrics used for the detection and classification of TEs, preserving information such as representation and observations (i.e., the properties measured).Next, we evaluated each metric that was reported as a decisive source in relevant publications.The characteristics and properties of each metric were analyzed regarding their application to TEs, considering that these elements have some characteristics, such as highly variant dynamics for each class, negative datasets with a large number of genomic elements for detection, a great divergence between elements of the same class, and species specificity.

Measure Invariance Analysis
Comparing the performance measures in ML approaches is not straightforward, and although the most common way to select most informative measures is by using empirical analysis [49,50], an alternative methodology was proposed [40], which consists of assessing whether a given metric changes its value under certain modifications in the confusion matrix.This property is named measure invariance, and can be used to compare performance metrics without focusing on their experimental results but using their measuring characteristics such as detecting variations in the number of true positives (tp), false positives (fp), false negative (fn), or true negatives (tn) presented in the confusion matrix [40].Thus, a measure is invariant when its calculation function f which receives a confusion matrix produces the same value even if the confusion matrix has modifications.For example, consider the following confusion matrix m = .If we apply the function f over the new confusion matrix, so we obtain f (m ) = 0.78.In this case, we can conclude that accuracy cannot detect exchanges of positive and negative values and thus it is invariant due to f (m) = f (m ).
In this work, we used eight invariance properties to compare measures which were selected in the bibliographic analysis.All these invariances were derived from basic matrix operations, such as addition, scalar multiplication, and transposition of rows or columns, as following [40]:

•
Exchange of positives and negatives (I1): A measure presents invariance in this property , showing invariance corresponding to the distribution of classification results due to its inability to differentiate tp from tn and fn from fp.An invariant metric in this property may not be utilized in datasets highly unbalanced [40], such as the number of TEs belonging to each lineage in the Repbase or PGSB databases.

•
Change of true negative counts (I2): A measure presents invariance in this property if , demonstrating the inability to recognize specificity of the classifiers.This property can be useful in problems with multi-modal negative class (the class with all elements other than the positive), i.e., in the detection of TEs, where negative class may be composed by all other genomic features such as genes, CDS (coding sequences), and simple repeats, among others.

•
Change of true positive counts (I3): A measure presents invariance in this property if , losing the sensitivity of the classifiers, so their evaluation should be complementary to other metrics.Properties described above were calculated by [40] for commonly used performance measures and we used them to analyze selected metrics (Table 1), except for area under the precision-recall curve (auPRC) which was calculated by us, following the methodology proposed by authors.

Experimental Analysis
To test the behavior of the most commonly used metrics, such as accuracy, precision, and recall, and the best scoring metric found in this study, we performed several experiments addressing the specific problem of multi-class classification of LTR retrotransposons at the lineage level in plants.We selected this problem since LTR retrotransposons are the most common repeat sequences in almost all angiosperms and they represent an important fraction of their host genome; for instance, 75% in maize [51], 67% in wheat [52], 55% in Sorghum bicolor [53], and 42% in Robusta coffee [54].As input, we used two well-known TE databases: Repbase (free version, 2017) [14] and PGSB [17].For Repbase, we joined the LTR domains with the internal section (concatenating before and after) of each LTR retrotransposon found in the database.The first step was to generate a well-curated dataset of LTR retrotransposons; thus, we classified LTR retrotransposons from both databases at the lineage level using the homology-based Inpactor software [55] with RexDB nomenclature [9].Inpactor has two filters for deleting nested elements: (1) Removing elements with domains belonging to two different superfamilies (i.e., Copia and Gypsy) and (2) removing elements with domains belonging to two or more different lineages.Additionally, we applied three extra filters: (1) Removing elements with lengths different from those reported by the Gypsy Database [19] with a tolerance of 20% (this value was chosen to filter elements with nested insertion of others TEs but keeping elements with natural divergence), (2) removing elements with less than two domains (incomplete elements derived from deletion processes), and (3) removing elements with insertions of partial or complete TEs from class II (present in Repbase).Finally, we removed elements from the following lineages: Alesia, Bryco, Lyco, Gymco, Osser, Tar, CHLAMYVIR, Retand, Phygy, and Selgy due to their very low frequency or absence in angiosperms.
Since the datasets used in this study are categorical (nucleotide sequences), we transformed them using the coding schemes shown in Table 2. Also, we used two additional techniques to automatically extract features from the sequences; (1) for each element, we obtained k-mer frequencies using k values between one and six (this range of values of k was selected due to k-mers with k > 6 are rare in sequences and probably do not provide informational features and they are computationally expensive to calculate) and (2) we extracted three physical-chemical (PC) properties, such as average hydrogen bonding energy per base pair (bp), stacking energy (per bp), and solvation energy (per bp), which are calculated by taking the first di-nucleotide and then moving in a sliding window of one base at a time [56].Since the ML algorithms used here require sequences of the same lengths, we found the largest TE in each dataset and completed the smaller sequences by replicating their nucleotides.

Table 2.
Coding schemes for translating DNA characters in numerical representations.Adapted from [13].Galois
The experiments consisted in executing all possible combinations between databases, coding schemes, pre-processing strategies, and ML algorithms (Figure 1 and Table 4).First, we used the accuracy and, the F1-score using the macro-averaging strategy as main metric in tuning process (Table 3).Finally, we calculated other common metrics using the best value of the tuned parameter in each algorithm for comparison.All the experiments were performed using Python 3.6 and Scikit-Learn library 0.22 [63], installed in a Anaconda environment in Linux over a CPU architecture.We ran our tests using the HPC cluster of IFB (https://www.france-bioinformatique.fr), IRD itrop (https://bioinfo.ird.fr/) and Genotoul Bioinformatics platform (http://bioinfo.genotoul.fr/),all of them are managed by Slurm.The experiments consisted in executing all possible combinations between databases, coding schemes, pre-processing strategies, and ML algorithms (Figure 1 and Table 4).First, we used the accuracy and, the F1-score using the macro-averaging strategy as main metric in tuning process (Table 3).Finally, we calculated other common metrics using the best value of the tuned parameter in each algorithm for comparison.All the experiments were performed using Python 3.6 and Scikit-Learn library 0.22 [63], installed in a Anaconda environment in Linux over a CPU architecture.We ran our tests using the HPC cluster of IFB (https://www.france-bioinformatique.fr), IRD itrop (https://bioinfo.ird.fr/) and Genotoul Bioinformatics platform (http://bioinfo.genotoul.fr/),all of them are managed by Slurm.

Bibliography and Invariance Analysis
Based on relevant literature sources (articles) detected in [13], by searching in several databases, we collected 26 metrics that are commonly used in different types of classification tasks, such as binary, multi-class, and hierarchical (Table 5).We were interested in classification metrics because the detection task can be considered as a binary classification (using TEs as positive class and non-TEs as negative class).Additionally, we assigned an importance level for each metric (Table 5) based on the following aspects: (i) How appropriate is its application to analyzing TE datasets (detection and classification)?(ii) Which features are measured and how important are these features for TE analysis?For each metric, each is assigned a level of importance (low, medium, high).Furthermore, the properties reported in relevant publications were used to evaluate each metric.In this way, we extracted and summarized information about each metric and we evaluated if its use for TE datasets is plausible.General observations of metrics can be found in the observations column in Table S4.Fscore↑ hierarchical [11,23,24,27] High High Although a and b are areas under the curve, they can be viewed as a linear transformation of the Youden Index [73].
Rows in bold were selected to perform invariance analyses.Additional information, such as metric representation and general observations of this table, is available in Table S1-S3 and observations about levels of applicability and measured features can be found in Table S4: Rows in bold were selected to perform invariance analyses.
We performed invariance analyses on the metrics with the best evaluation for each classification type (Table 1).Precision-Recall curves were excluded for further analysis at this step since it is impossible to calculate graphics from a confusion matrix.We obtained the invariance properties for almost all metrics from [40], except for area under the precision recall curve (ID = 11).For this metric, we generated a random confusion matrix and applied all the transformations presented in [40]

in order
Processes 2020, 8, 638 9 of 18 to calculate its value and determine if it changed or not.The invariance analyses were performed based on the one described by [40].

Experimental Analysis
To evaluate the relevance of the literature reports about metrics, we applied them to experiments on the multi-class classification of LTR retrotransposons in angiosperm plants at the lineage level.Using nucleotide sequences from Repbase [14] (free version, 2017) and PGSB [17] as input, we performed a classification process using Inpactor [55].We generated high-quality datasets by removing sequences that did not satisfy certain filters (See Materials and Methods).After filtering and homology-based classification, we obtained 2,842 TEs from Repbase and 26,371 elements from PGSB (Table 6).We executed four experiments using the generated datasets (Table 4) to evaluate the behavior of each metric in different configurations.In the first two experiments, we were interested in analyzing the performance of accuracy and F1-score metrics using a well-curated dataset (Repbase) but with a few different sequences in some lineages (Figure 2, Figures S1 and S2).In the last two experiments, we evaluated a larger dataset (PGSB) and tested the same two metrics (Figure 3, Figure S3 and S4).The complete results of all the experiments can be consulted in Tables S5-S8.We executed four experiments using the generated datasets (Table 4) to evaluate the behavior of each metric in different configurations.In the first two experiments, we were interested in analyzing the performance of accuracy and F1-score metrics using a well-curated dataset (Repbase) but with a few different sequences in some lineages (Figures 2, S1, and S2).In the last two experiments, we evaluated a larger dataset (PGSB) and tested the same two metrics (Figures 3, S3, and S4).The complete results of all the experiments can be consulted in Tables S5-S8.Figures 2 and 3 show the best performance achieved by each algorithm after tuning one parameter (Table 3), using as main metric accuracy or F1-score.Since each coding scheme displayed a different behavior, we were interested in further analyzing how each metric behaves in different algorithms and coding schemes.K-mers (Figure 4) showed the best performance, PC (Figure 5) displayed the worst performances, and complementary (Figure 6) showed an intermediate performance, which were selected for further analyses.Figures 2 and 3 show the best performance achieved by each algorithm after tuning one parameter (Table 3), using as main metric accuracy or F1-score.Since each coding scheme displayed a different behavior, we were interested in further analyzing how each metric behaves in different algorithms and coding schemes.K-mers (Figure 4) showed the best performance, PC (Figure 5) displayed the worst performances, and complementary (Figure 6) showed an intermediate performance, which were selected for further analyses.

Discussion
The detection and classification of transposable elements is a crucial step in the annotation of sequenced genomes, because of their relation with genome evolution, gene function, regulation, and alteration of expression, among others [74,75].This step remains challenging given their abundance and diverse classes and orders.In addition, other characteristics of TEs, such as a relatively low selection pressure and a more rapid evolution than coding genes [26], their dynamic evolution due to insertions of other TEs (nested insertion), illegitimate and unequal recombination, cellular gene capture, and inter-chromosomal and tandem duplications [76], make them difficult targets for accurate and rapid detection and classification procedures.Indeed, TEs showing uniform structures

Discussion
The detection and classification of transposable elements is a crucial step in the annotation of sequenced genomes, because of their relation with genome evolution, gene function, regulation, and alteration of expression, among others [74,75].This step remains challenging given their abundance and diverse classes and orders.In addition, other characteristics of TEs, such as a relatively low selection pressure and a more rapid evolution than coding genes [26], their dynamic evolution due to insertions of other TEs (nested insertion), illegitimate and unequal recombination, cellular gene capture, and inter-chromosomal and tandem duplications [76], make them difficult targets for accurate and rapid detection and classification procedures.Indeed, TEs showing uniform structures and well-established mechanisms of transposition can be easily clustered and classified into major groups such as orders or superfamilies (e.g., LTR retrotransposons) [77].However, this task is relatively complex and time-consuming when classifying TEs into lower levels, such as lineages or families [78].For these reasons, TE classification and annotation are complex bioinformatics tasks [79], in which, in some cases, manual curation of sequences is required by specialists.The ability of biologists to sequence any organism or a group of organisms in a relatively short time and at relatively low costs redefines the barrier of the genomic information.The current limitation is not the generation of genome sequences but the amount of information to be processed in a limited time.Complex bioinformatics tasks may be accomplished by machine learning algorithms, such as in drug discovery and other medical applications [80], genomic research [38,81], metagenomics [31,82], and multiple applications in proteomics [83].
Previous works apply ML and DL for TE analysis, such as Arango-López et al. (2017) [43] for the classification of LTR-retrotransposons, Loureiro et al. (2012) [84] for the detection and classification of TEs using developed bioinformatics tools, and Ashlock and Datta (2012) [69] distinguishing between retroviral LTRs and SINEs (short interspersed nuclear elements).Deep neural networks (DNN) are also used to hierarchically classify TEs by applying fully connected DNN [11] and through convolutional neural networks (CNN) and multi-class approaches [47].
In TE detection and classification, the dataset could be highly imbalanced [23]; therefore, commonly used metrics such as accuracy and ROC curves may not be fully adequate [36].For the detection task, the positive class will be much lower than the negative, because the latter will have all other genomic elements.In classification, each type of TE (classes, orders, superfamilies, lineages, or families) has different dynamics that produce a distinct number of copies.For example, in the coffee genus, LTR-retrotransposons show large copy number differences depending on the lineage [85].In Oryza australiensis [86] and pineapple genomes [87], only one family of LTR-retrotransposons contributes to 26% and 15% (Pusofa) of the total genome size, respectively.
For binary classification (for example, to detect TEs or classify them into class 1 and class 2), the most appropriate metric is F1-score (id = 7), which considers precision and recall values.Precision is a useful parameter when the number of false-positive must be limited and recall measures how many positive samples are captured by the positive predicted [36].However, the use of only one of these metrics cannot provide a full picture of the algorithm performance.Altogether, our results suggest that F1-score is appropriate for TE analyses.
In multi-class approaches (such as TE classification into orders, superfamilies, or lineages), F1-score (id = 20) also seems to be the most suitable metric, combined with the macro-averaging strategy, probably due to the high diversity of intra-class samples.For TE detection and classification, it appears more important to weigh all classes equally than to weigh each sample equally (micro-averaging strategy).Finally, for hierarchical classification approaches (i.e., considering the hierarchical classification of TEs proposed by Wicker and coworkers [8]), F1-score↓ (id = 26) and F1-score↑ (id = 23) seem most suitable.These results demonstrate the importance of calculating the performance of each hierarchical level.Additionally, precision-recall curves and area under the precision-recall curve provided the best results for binary classification, demonstrating that, for TE datasets, they are more appropriate than the commonly used ROC curves.
Area under the precision-recall curve, auPRC (id = 11), is a unique metric, which showed invariance in I1 and non-invariance in I2.Its invariance properties make auPRC a robust measure of the overall performance of an algorithm and it is insensitive to the performance for a specific class (I1).However, it less appropriate for data with a multi-modal negative class (~I2).
= 10, fn = 4, fp = 3, and tn = 16 and the function for calculating accuracy f = tp+tn tp+ f p+ f n+tn , thus the accuracy for the confusion matrix presented above is f (m) = 0.78.Now consider exchanging the positive (tp by tn) and negative (fp by fn) values in the confusion matrix obtaining the following m = 16 3 4 10 invariance properties of this metric were calculated by authors in this study.I1: Exchange of positives and negatives, I2: Change of true negative counts, I3: Change of true positive counts, I4: Change of false negative counts, I5: Change of false positive counts, I6: Uniform change of positives and negatives, I7: Change of positive and negative columns, and I8: Change of positive and negative rows.

Figure 1 .
Figure 1.Overall flow of the experimental analysis done in this work.

Figure 1 .
Figure 1.Overall flow of the experimental analysis done in this work.

Figure 2 .
Figure 2. Performance of machine learning (ML) algorithms and Repbase pre-processed data by principal component analysis (PCA) and scaling processes using as main metric: (A) accuracy and (B) F1-score.

Figure 2 .
Figure 2. Performance of machine learning (ML) algorithms and Repbase pre-processed data by principal component analysis (PCA) and scaling processes using as main metric: (A) accuracy and (B) F1-score.

Figure 2 .
Figure 2. Performance of machine learning (ML) algorithms and Repbase pre-processed data by principal component analysis (PCA) and scaling processes using as main metric: (A) accuracy and (B) F1-score.

Figure 3 .
Figure 3. Performance of ML algorithms and PGSB pre-processed data by PCA and scaling processes using as main metric: (A) Accuracy and (B) F1-score.

Figure 3 .
Figure 3. Performance of ML algorithms and PGSB pre-processed data by PCA and scaling processes using as main metric: (A) Accuracy and (B) F1-score.

Author Contributions:
Conceptualization, S.O.-A., G.I., and R.G.; methodology, S.O.-A., J.S.P., and R.T.-S.; writing-original draft preparation, S.O.-A., R.T.-S., J.S.P., L.F.C.-O., G.I., and R.G.; writing-review and editing, S.O.-A., R.T.-S., J.S.P., L.F.C.-O., G.I., and R.G.; supervision, G.I. and R.G.All authors have read and agreed to the published version of the manuscript.Funding: Simon Orozco-Arias is supported by a Ph.D. grant from the Ministry of Science, Technology and Innovation (Minciencias) of Colombia, Grant Call 785/2017.The authors and publication fees were supported by Universidad Autónoma de Manizales, Manizales, Colombia under project 589-089, and Romain Guyot was supported by the LMI BIO-INCA.The funders had no role in the study design, data collection and analysis, the decision to publish, or preparation of the manuscript.
It indicates if a measure's value changes when the size of the dataset increases.The non-invariance indicates that the application of the metric depends on size of the data.If a metric is unchanged in this way, it will not show changes when additional datasets differs from training datasets in quality (i.e., having more noise), and indicating the needed of other measures as complement.On the contrary, if a metric presents a non-invariant behavior then, it may be suitable if different performances are expected across classes.In this case, if a metric is non-invariant, its applicability depends on the quality of the classes.It may be useful, for example, when curated datasets are available such as Repbase.
[26]ange of false negative counts (I4): A measure presents invariance in this property if , proving reliable results even though some classes contain outliers, which is common in elements classified at lineage level due to TE diversity in their nucleotide sequences[26].•Uniformchange of positives and negatives (I6): A measure presents invariance in this property if

Table 4 .
Description of experiments performed.

Table 6 .
Number of nucleotide sequences of each plant lineage used in Repbase and PGSB databases.

Table 6 .
Number of nucleotide sequences of each plant lineage used in Repbase and PGSB databases.