Harnessing the Power of Metabarcoding in the Ecological Interpretation of Plant-Pollinator DNA Data: Strategies and Consequences of Filtering Approaches

: Although DNA metabarcoding of pollen mixtures has been increasingly used in the ﬁeld of pollination biology, methodological and interpretation issues arise due to its high sensitivity. Filtering or maintaining false positives, contaminants, and rare taxa or molecular features could lead to different ecological results. Here, we reviewed how this choice has been addressed in 43 studies featuring pollen DNA metabarcoding, which highlighted a very high heterogeneity of ﬁltering methods. We assessed how these strategies shaped pollen assemblage composition, species richness, and interaction networks. To do so, we compared four processing methods: unﬁltering, ﬁltering with a proportional 1% of sample reads, a ﬁxed threshold of 100 reads, and the ROC approach (Receiver Operator Characteristic). The results indicated that ﬁltering impacted species composition and reduced species richness, with ROC emerging as a conservative approach. Moreover, in contrast to unﬁltered networks, ﬁltering decreased network Connectance and Entropy, and it increased Modularity and Connectivity, indicating that using cut-off thresholds better describes interactions. Overall, unﬁltering might compromise reliable ecological interpretations, unless a study targets rare species. We discuss the suitability of each ﬁltering type, plead for justifying ﬁltering strategies on biological or methodological bases and for developing shared approaches to make future studies more comparable.


Introduction
The study of Plant-Pollinator interactions is pivotal to address both theoretical and applicative issues, with important implications in evolutionary studies, conservation biology, and agrifood security, and it is relevant for providing reliable policies of land-use management and mitigation of anthropogenic stressors [1][2][3][4].
Traditionally, studies of Plant-Pollinator interactions have been carried out with direct field observations of animal foraging activity during flower visitation [5,6]. However, to unveil Plant-Pollinator interactions, it is a valuable approach to classify the pollen grains carried on the pollinator's body [7,8]. This pollen might be accidentally picked and carried by flower visitors when they touch plant reproductive structures. Alternatively, it can be actively collected and accumulated in specialized structures such as the scopa or the corbiculae in the case of some bee species. The identification of the transported pollen allows discovering of the foraging "history" of flower visitors prior to a sampling event. In this way, it is possible to retrieve complete behavioural and ecological information on flower resource exploitation and to address ecological research questions in a potentially fine detail. To achieve pollen identification, classical palynology based on morphology has traditionally been used. This approach could provide lot of information about sample composition. However, it requires high expertise with pollen morphological assessment and, based on the operator's expertise, can be time-consuming [9,10]. In addition, a detailed taxonomic resolution through morphological criteria could be limited by the lack of diagnostic characters among taxa [11]. The morphological approach could be useful in cases of a low number of pollen samples to be identified or for gathering quantitative information on pollen amounts [12].
In the last decade, morphological difficulties have progressively been overcome by using DNA-based approaches that significantly reduce the time required for pollen identification [13,14]. Recent developments in DNA sequencing technologies, especially those based on High-Throughput Sequencing (HTS), made it possible to analyse the taxonomic composition of complex DNA matrices. For example, mixed pollen samples can be characterized by using standard DNA barcode regions in a so-called DNA metabarcoding approach [15,16]. For pollen-based studies, DNA metabarcoding is becoming a standard approach, being employed not only for the characterization of the pollen retrieved from animal bodies (see e.g., [17]), but also in the analysis of other kinds of samples such as cavity nests [18], honey [19], sediments [20,21], and forensics [10,22]. In the context of Plant-Pollinator interactions, the type of data retrieved from pollen DNA metabarcoding could potentially shed light on the foraging habits of flower visitors or evaluate the complexity and resilience of the interaction networks. This methodological revolution not only improved ecological knowledge, but it also offered new insights into the development of effective conservation and restoration actions [9].
Given the astounding number of sequences (hereafter "reads") [23,24], the data from HTS techniques require a proper bioinformatic pipeline. This is a critical phase of the dry lab activities and usually consists in (i) the assembly of paired-end reads resulting from bidirectional sequencing of the DNA templates, (ii) the analysis of the variation among sequences and the clustering of molecular features (e.g., Operational Taxonomic Units OTUs sensu [25] or Exact Sequence Variants ESVs sensu [26]), and (iii) the removal of chimeras, artifacts, and spurious sequences [27]. However, this bioinformatics process does not completely solve all the potential biases. Additional artifacts, hereafter referred to as false positives, result from clusters of molecular features (i.e., OTUs and ESVs) generated as a consequence of inaccuracies during field sampling operations (e.g., cross-contamination among samples), laboratory processing (e.g., contamination of DNA extraction or amplification reagents), or from some steps of the bioinformatics analysis (e.g., taxonomic misidentification of molecular features) [24,28]. The presence of infrequently detected molecular features or taxa might add further background noise in the output of a DNA metabarcoding pipeline. Given the extreme sensitivity of DNA metabarcoding, it is crucial to filter out false positives and contaminants, which could significantly alter the reconstruction of samples composition. Moreover, rare features or taxa should be treated consciously during the postsequencing bioinformatics processing and possibly removed, depending on the study aims and the required sensitivity of the analysis [27][28][29][30]. However, the resulting species composition of a sample could be biased by the disapplication or misapplication of cut-off thresholds. For instance, DNA metabarcoding could detect the occurrence of particularly infrequent taxa, which may be of interest in some specific cases (e.g., tracking the origin of a sample based on rare pollen). On the other hand, these may produce a great impact on the ecological interpretation of results, especially when reads counts are converted into presence/absence data. Such a situation could lead to the overestimation of the generalist attitudes of the investigated pollinators and to misleading ecological interpretations.
The application of an appropriate cut-off threshold to filter the DNA metabarcoding data from the signal of possible false positives and rare taxa or features is therefore a critical step of the bioinformatics pipeline. Although some studies have not applied any cut-off threshold, different approaches for filtering false positives and rare taxa or features have been used so far in recent literature. In practice, some studies applied fixed cut-off thresholds, such as a defined number of reads used as reference level for accepting a Diversity 2021, 13, 437 3 of 19 molecular feature or taxon in a sample (e.g., [31]). Other studies employed proportional cut-off thresholds, where molecular features or taxon are discarded if represented by less than a certain percentage of the total reads of a sample (e.g., [32]). Alternatively, statistical approaches have been used for estimating a variable threshold based on Receiver Operator Characteristic (ROC) curves, thus depending on the distribution of reads among molecular features or taxa within a sample [17]. This highlights the absence of agreement on whether and how to prune a DNA metabarcoding output. However, to date, no studies have investigated the effect of each of the abovementioned filtering strategies on molecular datasets concerning pollen samples (or honey) and Plant-Pollinator interactions.
In this study, we investigated and summarized the criteria and the strategies adopted for filtering out the false positives and rare features or taxa, focusing on published studies on pollen DNA metabarcoding. Moreover, we aimed at evaluating the direct ecological effects of the most commonly applied filtering methods on publicly available datasets of pollen/honey DNA metabarcoding. To do this, we measured how unfiltering or different cut-off thresholds impacted (i) plant species composition and species richness and (ii) the interactions among plants and pollinators described by network indexes. With these aims, we evaluated how the different filtering strategies could alter the identification of species and of interactions, and thus the ecological interpretation of the results.

Filtering Taxa from Pollen DNA Metabarcoding: Literature Overview
To revise the types of filtering and the methodology applied in the scientific literature used to remove (or not) false positives and rare taxa or features, bibliographical research was conducted in Scopus using the following keywords: "DNA" + "metabarcoding" + "pollen". Within the results of the query, we selected only peer-reviewed original published articles that dealt with pollination, pollinator diet (pollen and honey), and plant-pollinator interactions by using a DNA metabarcoding approach (we excluded reviews, news, views, opinions, perspectives papers, and studies on airborne pollen or other pollen matrices when unrelated to pollinators). We selected studies spanning between 2012, when the term DNA metabarcoding was proposed for the first time [16], and 2021 (last update on 9 May 2021). The retrieved articles were used to create a summarizing table (Table 1) including: (i) the type of sample from which the DNA was extracted, (ii) the studied organism, (iii) the details of the filtering applied, and (iv) the DNA barcoding markers used to achieve the amplification reaction. Table 1. List of published studies subjected to review, including details on referencing, used samples in the DNA metabarcoding analysis, the organisms from which the pollen samples were collected, and the cut-off threshold with a brief explanation of the filtering actually applied. Additional information is given on the DNA barcode marker(s) used and on whether the dataset was included in the present study.

Source
Type

Evaluating the Consequences of Filtering (or Not) Taxa
To evaluate how the application (or not) of different cut-off thresholds could lead to changes in the results and their interpretation, we retrieved publicly available DNA metabarcoding datasets based on the ITS2 DNA barcode marker (that is the most used in pollen DNA metabarcoding studies) from the previously mentioned literature search. Only those datasets which were not preliminarily filtered were kept for our analysis (see Table 1). In detail, we retrieved published nonfiltered datasets (hereafter named as "no cut", equivalent to a 0-reads threshold), and we derived several subsequent filtered versions by separately applying three different approaches for filtering false positives and rare taxa or features. The filtering approaches were chosen based on utilization frequency in the literature or on their biological reliability (i.e., the ROC approach). Specifically, the first method is based on a fixed threshold, and it removes from a sample the molecular features or taxa represented by less than 100 reads (hereafter "fixed 100 reads"), thus mimicking studies where exclusion thresholds are based on reads found in sequencing blanks (e.g., [12]). The second method is proportional and discards what is represented in a sample by a number of reads lower than 1% of the total sample count of reads (hereafter "proportional 1%") as used for example in [37]. The third one estimates a cutoff threshold accounting for the distribution of reads among molecular features, thus providing a customized proportion for each sample through the statistical ROC curve approach, as indicated in [17] (hereafter "statistical ROC"). This strategy is commonly applied in several disciplines, and it was specifically proposed for the detection of false positives [70]. We applied the ROC approach in the same way as it was done in the pollenbased literature, thus following the procedure of [17]; see Supplementary Material Text S1 for a script, although different implementations of ROC are possible and they might affect the final estimations. We associated a variable coded as "negative" or "positive" to each taxon of a sample. Specifically, "negative" was assigned if its reads were 0; otherwise, "positive". We fitted a Generalized Linear Regression with an overdispersed Poisson distribution (quasi-Poisson) for each sample to model the distribution of the amount of reads per taxa (quantitative response) between "positives"/"negatives" (categorical predictor). Fitting a regression is a necessary step for later estimating the false positives of a sample that otherwise are usually not known in DNA metabarcoding data. The predicted reads distribution was processed with the pROC package [71] (in the R environment) that uses the roc function to build ROC curves between the reads per sample estimated by the GLM and true "positives"/"negatives" (those used to fit the regression). The optimal threshold of reads below which taxa should be excluded was obtained with the function coord in the same package based on the Youden's J statistic [72] (see [71] for further details).
For each dataset, changes in plant species composition and species richness (that was standardized for the maximum number of species observed in a sample) for each pollen sample was evaluated in response to the type of filtering used (i.e., no cut, fixed 100 reads, proportional 1%, and statistical ROC). For the comparison of pollen species composition, we used a Permutational Manova based on distance matrices (with Jaccard distance index), which is an analysis of variance that uses a permutations test with pseudo-F ratio [73]. This analysis was performed through the adonis function with R-package vegan [74], where each dataset was analysed independently. The effect of the different cut-off thresholds on species richness was evaluated through a Generalized Linear Mixed Model (GLMM) with species richness as response variable and the type of filtering used (i.e., no cut, fixed 100 reads, proportional 1%, and statistical ROC) as covariate. The identity of the pollinator animal nested within the dataset was set as a random effect.
Network indices describing the interactions between plants and pollinators were calculated. Specifically, the analysed indices were Connectance (i.e., proportion of possible links actually recorded), Modularity (i.e., a measure of how interactions are distributed into modules, where species within modules mostly interact with each other), and Shannon Entropy (i.e., a measure of the overall diversity and complexity in the interactions of a network). Furthermore, at the level of a single individual pollinator, the Connectivity index was calculated. This index of centrality quantifies the putative central role of an individual or of a species in connecting different parts of the whole network [75]. It could provide information in ranking individuals or species according to their contribution to the stability of the interactions and the cohesion among network participants. Network indices were calculated through the R-package bipartite (specifically for Connectance, Modularity, and Entropy) and rnetcarto (for Connectivity) [76,77]. For this purpose, only those datasets originated from direct characterization of pollinator foraging were used, excluding a study on mock samples [24] and a study with an incomparable experimental design [38] that instead were used in the other analyses. Changes in interaction indices at the network level (Connectance, Modularity, Entropy) were evaluated through either a Linear or a Generalized Mixed Model depending on the distribution and range of the response variable. The type of filtering used was included in the models as covariate, and the dataset identity, as random effect. The individual level (i.e., samples) Connectivity was analysed as response variable, the type of filtering used, as covariate in interaction with the normalized degree of the pollinator individuals. This normalized degree was calculated as the number of plant species found in each sample divided by the overall number of plants. The inclusion of the normalized degree in this analysis allowed us to describe the variation of Connectivity across the entire specialism-generalism spectrum of an individual and in relation to the applied filtering approach. In this case, the sample identity nested within the dataset was included in the model as a random effect.
For all the mentioned regression analyses, a comparison among the adopted strategies of filtering was performed through a post hoc test (Tukey's HSD test). All the statistical analyses explained above were carried out with R (Version 3.6.1 of R).

Filtering Taxa from Pollen DNA Metabarcoding: Literature Overview
Overall, 43 research articles on pollen DNA metabarcoding were found and reviewed concerning the strategy of filtering of false positive and rare taxa or features (Table 1). About one quarter of studies did not apply any filtering approach, while the remaining ones applied at least a filtering type. Specifically, the proportional cut-off threshold was the most applied method (11 studies, 28%). In these studies, the cut-off threshold calculated as 1% of the number of reads produced by each sample was the most recurrent. Other filtering types were less common. Only one study used a statistical approach (i.e., the ROC curve; [16]) to set a proportional cut-off threshold. Nine studies (21%) used a fixed number of reads chosen arbitrarily as cut-off threshold (e.g., 100 or 1000 reads), while five other studies (12%) used the number of reads produced by negative controls to set the threshold to remove false positives and rare taxa or features. Finally, five studies (12%) used more than one filtering approach simultaneously. Details and a brief explanation of the strategies applied to set the cut-off threshold for the reviewed studies are reported in Table 1.
Twenty-eight published studies (65%) recovered the pollen samples from the whole insect's body or from specific body parts such as scopa and corbiculae. Four studies (9%) focused on pollen stored in cavity nests or in hives, while five (12%) investigated mixed pollen mock samples to address methodological issues (e.g., the optimization of DNA extraction or quantitative use of DNA metabarcoding reads). Finally, six studies (14%) analysed the taxonomic composition of honey by looking at the pollen grains contained in it.
Most of these studies (65%) used the ITS2 marker as a DNA barcode region for species identification, although in some these cases (28%), this marker was combined with other barcode loci (e.g., rbcL).

Evaluating the Consequences of Filtering (or Not) Taxa
From the 43 reviewed studies, eight nonfiltered and publicly available ITS2 DNA metabarcoding datasets were retrieved. Among these, four were obtained by processing the pollen found in nests or carried on insects' bodies [17,35,67,69]. Three datasets contained data from honey samples [38,45,50], and one was obtained from the analysis of pollen mock samples [24].
Significant changes in the composition of pollen samples depending on the filtering approach are summarized in Table 2. Specifically, the main differences occurred between the no-cut and all the other filtering approaches, in all datasets (Table 2). Minor changes in community composition among fixed 100 reads, proportional 1%, and statistical ROC were only occasionally found ( Table 2). Table 2. Comparison of cut-off thresholds applied on pollen species composition of samples from several datasets, based on Permutational Manova. Dataset names (entitled with main author and year; see Table 1 for further details) are reported in the first column "Dataset". The column "F-value" reports the pseudo-F ratio value and the associated significance p (α = 0.05). Significant cases are reported in bold. Plant species richness inferred from pollen samples was significantly influenced by the filtering approach (X 3 2 = 468.22, p < 0.001). Specifically, higher species richness per sample was found in the unfiltered type (i.e., no cut) compared to all the other filtering approaches. A significant difference between the proportional 1% and the statistical ROC approaches was also found, with the latter reducing species richness even more ( Figure 1a, Table 3).    Table 3.   Table 3.
The individual level index of Connectivity showed a significant effect of the interaction between the filtering approach and the normalized degree index (X 3 2 = 609.2, p < 0.001).
Specifically, the Connectivity was lower in the unfiltered (no cut) compared to all the other filtering types for any value of the normalized degree (i.e., for both generalist and specialist individual pollinators; Figure 2, Table 4).
with normalized degree of individual pollinator (Tukey's multiple comparison test, α = 0.05). Significant cases are highlighted in bold.

Discussion
Since its "formalization" in 2012, the DNA metabarcoding approach has revolutionized the field of biodiversity investigation, and it has even provided insights for studying biological interactions. Its application has rapidly spread and has contributed to research contexts such as microbiome [78,79], food [80][81][82], trophic ecology [83,84], and environmental DNA-based analyses [85]. In spite of its usefulness, methodological choices during the whole DNA metabarcoding pipeline and specifically the bioinformatics processing could deeply influence the obtained results and their interpretation [86][87][88]. Therefore, in this study, we attempted to evaluate the effects of the approach used to filter DNA metabarcoding outputs. Specifically, we focused on the analysis of DNA metabarcoding of pollen in the framework of plant-pollinator interactions, being aware that the outputs of our investigation could be extended to the other typologies of DNA metabarcoding-based studies. Although the issue of removing false positives and rare taxa or features is quite neglected in the scientific literature in relation to the bioinformatic pipeline (but see [28,30]), the choices made when analysing a HTS output could generate relevant effects on the community composition, species richness, and species interactions. These aspects would deeply impact the ecological outcomes of the investigated system.
The high potential of using DNA metabarcoding outputs for pollen analysis would also merit a robust bioinformatic pipeline that should be coherent and comparable among different studies. Instead, our literature overview highlighted a high heterogeneity of filtering approaches adopted to remove false positives and infrequent taxa. This is particularly appreciable even among studies that focused on similar analytical matrices (e.g., pollen from animal bodies). Surprisingly, a similarly high heterogeneity in filtering approaches emerges from studies using morphological palynology ( [7] used a minimum number of 10 pollen grains, while [89,90] used a threshold of 5, and [91,92] retain species with a frequency of pollen grains above 10%, and [93], above 1%). In the case of DNA metabarcoding of pollen, our literature review showed that the proportional approach is the most recurrent, that is, to remove those molecular features or species present in the sample with reads under a certain proportion of the total reads per sample. This is quite expected, as it is an approach also well represented in other studies using DNA metabarcoding (e.g., [83,94,95]). Probably due to the ease of calculating proportions, they bear advantages when comparing different samples or when samples would be too depauperated after a fixed raw number of reads. However, we found no concordance between different authors about the exact amount of proportion to be used as threshold, and surprisingly about the reasons justifying the choice of a particular percentage or another one (e.g., [61] used 0.1%, while [59] used 0.01%; see Table 1). It should be noticed that Peel and colleagues [53], while analysing samples of pollen prepared ad hoc with a known composition, highlighted that false positives occurred at a rate lower than 1% per sample, thus supporting this filtering strategy. On the other hand, caution should be recommended prior to generalizing the 1% threshold as a universally effective filtering practice; for samples represented by extremely high total reads count, it might be better to use a lower value. Conversely, 1% threshold can also be ineffective with almost empty samples, such as in the case of a fly that has never visited a flower but that was contaminated by airborne pollen. In such cases, it might be worth using even higher threshold values to better safeguard from misleading information.
The second most recurrent cut-off approach found in literature is based on a fixed number of read counts, used as general threshold across all samples (e.g., 50 as in [63], 100 as in [68]; see Table 1), the most frequent amount being 100 reads per sample. As reported above, with this approach, the specific value of the cut-off threshold is poorly supported by clear biological reasons. The subjectivity of authors is an important factor, and it could be a source of biases, as, for instance, studies using high threshold values would likely remove a high proportion of truly occurring taxa. For example, [56,57] observed how a threshold of 1000 reads per plant species ensures the removal of the vast majority of grass pollen species, which, however, were taxa occurring at the study areas and shall be considered true positives from airborne pollen. Therefore, low or high cutting values could have been chosen depending on the need to remove false positives but also on potential environmental contamination or infrequent species. Conversely, in other studies, the fixed cut-off value is clearly derived from sequenced negative controls. In those cases, the maximum number of reads found in blank samples is usually set as threshold (see Table 1).
The assumption behind this approach is that it would allow removal of false positives exclusively originating from laboratory activities (i.e., during DNA isolation, PCR, and sequencing) [24]. However, the impact of using blanks to yield thresholds is not so clear when it comes to rare or infrequent species that might have fewer sequencing reads than controls, and those cases would be systematically removed by this approach. Regarding this, the development of practices to retrieve negative controls for field contaminations (as hypothesized in [24]) could probably further improve the potential of this filtering method, allowing for better discrimination between species originated from field contamination and rare but truly occurring taxa.
Unfiltering seems controversial. The literature survey (Table 1) showed that nearly a quarter of studies did not report a filtering approach (based on reads count) and possibly did not filter the datasets with quantitative thresholds. However, in some of these cases, a manual filtering was used to remove the species that were not plausible in the study area [48,49,57,67], while in others, the concordance between multiple DNA barcoding markers is employed [58], thus at least partly following the recommendations to remove false positives and possibly rare taxa or features [96]. The analyses of our study clearly suggest that using a cut-off threshold for filtering the HTS output leads to significant differences compared to the unfiltered output matrix, especially in species composition, species richness and plant-pollinator interactions, impacting the ecological interpretation of the data. Our results indicated that, firstly, any of the cut-off thresholds yielded a community composition different from those obtained through unfiltered data. Moreover, filtering decreased species richness in comparison to nonfiltered data. These differences between unfiltered and filtered data could even be amplified under particular research aims. For example, in studies focusing on pollinator foraging behaviours, the unfiltering could overestimate the number of plants foraged by an animal, and it could obviously lead to an overestimation of generalism, foraging niche, and delivered ecosystem service of pollination. Another example derives from studies on honey composition, where a no-cut strategy could mislead on the purity of products, with consequences that could involve commercial issues. In our investigation, the filtering of false positives or rare taxa impacted not only the species composition and richness but also the ecological networks of species interactions. Specifically, we detected significant differences when comparing networks calculated from filtered to nonfiltered data. The implications of this network variation could potentially be very high, as, for instance, network Entropy, Connectance, Modularity, and Connectivity refer to network stability and resilience, to the ability to buffer perturbations, and to the stabilizing role of central hub species [40,[97][98][99]. Thus, the higher the difference between filtering or unfiltering strategies, the higher the potential for misleading ecological results obtained from the networks associated to each filtering type. For instance, our results showed that filtering decreased network Connectance and Entropy. This aligns well with the lower species richness per sample found in filtered datasets, and it can be explained by a decrease in network number of realized links (i.e., fewer plant species found on pollinator bodies or samples). In other words, by decreasing the numerosity of links, filtering likely yields networks with slightly higher element-specific linkage compared to nonfiltered networks. From an ecological point of view, this translates in a lower chance of overestimating generalism after filtering. Moreover, filtering increases Modularity and Connectivity of networks. This result further clarifies that removing ambiguous taxa decreases the ubiquity of links among elements, thus allowing for better emergence of ordered patterns of well-defined compartments of interactions (Modularity) and important hub species connecting them (characterized by a higher Connectivity). As a consequence, filters seem to increase the ecological reliability when describing how flower resources are used by foragers (i.e., Modularity) and the importance of certain species in contributing to interactions stability (i.e., Connectivity). In other words, unfiltering returns networks richer in links, which tend to be ubiquitously distributed among elements, with the high potential of overestimating foraging strategies and network resilience. Based on these considerations, researchers could prefer filtering their data. One exception to this would be when the role of rare plant species is targeted in the study [100]. In this case, manually checking an unfiltered dataset for unplausible taxa could limit the amount of false negatives. Moreover, integrating DNA metabarcoding data with traditional (quantitative) morphology of pollen could improve results reliability and interpretation [101]. This could possibly control for the presence of spurious information from DNA metabarcoding.
Among the filtering strategies analysed here, the statistical ROC approach appears to be the most conservative one, since it tends to yield the lowest species richness, the highest Modularity, and the lowest Connectance and Entropy. Thus, ROC-based filtering might remove not only the false positives from samples but also the infrequent species. It should be noted that ecological patterns emerging or confirmed even in a conservative framework are more likely to be trustworthy. Even if this approach has rarely been applied, to date, in the pollen DNA metabarcoding literature (Table 1), it was specifically developed to distinguish "true signals" from "noise" [102] in molecular biology and could constitute a promising avenue for processing data of pollen DNA metabarcoding. For instance, it has been used in other DNA-based research fields [103,104], such as for eDNA where it is proved to increase the reliability of data [105]. However, it would be promising to investigate how different parametrizations of the ROC approach would impact the estimations of cutting thresholds. Because ROC is a conservative approach, it may be favoured in studies willing to highlight ecologically meaningful species composition, richness, and interactions, while sacrificing the pursuit of high species richness based on keeping elements of rarity, potential contaminants, and false positives.

Conclusions
Our survey shed light on the possible consequences of using (un)filtering strategies of pollen DNA metabarcoding data in ecological and biological research. To date, this powerful molecular tool still requires the development of shared approaches on the bioinformatic filtering of molecular features. This would improve the provision of reliable, repeatable, and comparable data. In particular, we recommend that researchers (i) always make both raw unfiltered and filtered data easily accessible, thus improving the possibility of exploring large amounts of data and, consequently, the growing rate of human knowledge in strategic research fields such as pollination ecology. The authors should (ii) apply filtering from false positives and possibly also from infrequent species, depending on research aims. Moreover, (iii) the specific type of filtering must be clearly justified under a biological perspective, evaluating the efficiency and universality of the loci selected for species identification and the consequent taxonomic resolution of molecular feature assignments. Furthermore, (iv) the specific strategy has to be decided based on whether the research aim would benefit from a conservative filtering. Without an appropriate filtering, DNA metabarcoding reads converted to presence/absence data certainly yield spurious results [30,106,107]. To avoid this, conservative approaches like the ROC filtering must be preferably adopted. Otherwise, the application of a filter either based on a percentage with a clear biological support or based on a fixed value from negative controls is possible, although greater awareness should be placed on the risk of excluding only false positives while keeping environmental contamination and infrequent species.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/d13090437/s1, Text S1: Script for estimating filtering thresholds with an approach based on ROC curves.