Compilation, Revision, and Annotation of DNA Barcodes of Marine Invertebrate Non-Indigenous Species (NIS) Occurring in European Coastal Regions

: The introduction of non-indigenous species (NIS) is one of the major threats to the integrity of European coastal ecosystems. DNA-based assessments have been increasingly adopted for monitoring NIS. However, the accuracy of DNA-based taxonomic assignments is largely dependent on the completion and reliability of DNA barcode reference libraries. As such, we aimed to compile and audit a DNA barcode reference library for marine invertebrate NIS occurring in Europe. To do so, we compiled a list of NIS using three databases: the European Alien Species Information Network (EASIN), the Information System on Aquatic Non-indigenous and Cryptogenic Species (AquaNIS), and the World Register of Introduced Marine Species (WRiMS). For each species, we retrieved the available cytochrome c oxidase subunit I (COI) mitochondrial gene sequences from the Barcode of Life Data System (BOLD) and used the Barcode, Audit & Grade System (BAGS) to check congruence between morphospecies names and Barcode Index Numbers (BINs). From the 1249 species compiled, approximately 42% had records on BOLD, among which 56% were discordant. We further analyzed these cases to determine the causes of the discordances and attributed additional annotation tags. Of the 622 discordant BINs, after revision, 35% were successfully solved, which increased the number of NIS detected in metabarcoding datasets from 12 to 16. However, a fair number of BINs remained discordant. Reliability of reference barcode records is particularly critical in the case of NIS, where erroneous identiﬁcation may trigger action or inaction when not required.


Introduction
Coastal ecosystems are the source of remarkable biological productivity and an abundance of goods and services [1]. However, these ecosystems and their native biodiversity have been significantly damaged by human activities, global climate change, and biological invasions [2][3][4]. Non-indigenous species (NIS) can be introduced outside their typical distribution range naturally or by human mediation. If they survive, establish, and expand in the recipient ecosystems, they may become invasive, spreading rapidly, and displacing and out-competing native species, threatening the ecosystem's integrity [5,6]. These introductions, which are becoming more frequent due to climate change and increasing globalization, can cause negative effects on biodiversity, human health, and welfare, as well as major economic losses [7]. In this regard, NIS introductions are now included in key legislations and directives such as the European Union Regulation 1143/2014 [8] on Invasive Alien Species and the Marine Strategy Framework Directive [9]. For these reasons, it is critical to monitor non-indigenous species for better management and protection of marine environments [10]. Most past and ongoing monitoring programs are based exclusively on morphological approaches for species identification [11][12][13]. However, this is an expertise-demanding and time-consuming procedure, also hindered by the decrease in the number of taxonomists worldwide [14,15]. With the improvement of sequencing techniques, monitoring programs have started to implement DNA-based approaches, such as DNA barcoding and metabarcoding. While DNA barcoding consists of the identification of species using standardized short DNA fragments [16], DNA metabarcoding combines it with high-throughput sequencing (HTS), allowing the identification of multiple species and taxonomic groups from complex samples, including environmental samples [17,18]. Several advantages of DNAbased approaches over the traditional methods (i.e., morphology-based identification) include the increased sensitivity and specificity in the detection of species, including cryptic taxa, as well as a greater time and cost effectiveness through the simultaneous processing of a large number of samples, ultimately improving NIS detection and monitoring in coastal and marine ecosystems [19][20][21][22][23]. The use of DNA-based tools also allows the detection of early developmental stages (e.g., gametes, propagules, planktonic larvae, eggs) or smaller organisms not amenable to morphological identification or when organisms occur at low densities [24]. NIS early detection is crucial to overcome the irreversible impacts that these species, once established, can provoke in coastal and marine ecosystems.
DNA-based species identifications are highly dependent on the quantity and quality of molecular data present in the available genetic databases [25]. The main public databases used are GenBank ® (www.ncbi.nlm.nih.gov/genbank/ accessed on 25 January 2023) and the Barcode of Life Data System (BOLD) [26,27]. While GenBank possesses a much larger DNA sequence dataset, BOLD was created specifically for the acquisition, storage, analysis, and publication of DNA barcode records. The standard DNA barcode region for the animal kingdom is a~650bp fragment of the 5 end of the cytochrome c oxidase I (COI) mitochondrial gene, due to its ability to discriminate species in most animal taxa, including congeneric species [28]. Despite the increase in DNA barcode data availability over the years, there are still gaps in representative barcodes for a large proportion of European marine invertebrate species (e.g., [29][30][31]), reaching as much as 50 to 70% for dominant macroinvertebrate taxa [29,32], and up to 63% for NIS belonging to Animalia [33].
The Barcode Index Number (BIN) system was created in BOLD to assign a specific code (BIN) to a cluster of COI nucleotide sequences that represent Molecular Operational Taxonomic Units (MOTUs), delimited using the Refined Single Linkage (RESL) algorithm [34]. For most animal taxa, COI MOTUs match with species, hence BINs can be viewed as a proxy to species and be used as a molecular benchmark to test the taxonomic congruence of COI barcode records [34]. For a sequence to be barcode compliant, there is a set of formal criteria implemented in BOLD [27] to assess the quality of the molecular data and each record's metadata (see also [35]). However, a number of operational errors that may occur along the production chain of the reference DNA barcodes (e.g., misidentification of specimens, cross-contamination of PCR's DNA template, or mislabeling of records) cannot be addressed by such quality control criteria, thereby undermining the reliability of the taxonomic assignments of barcode records [32].
Indeed, a comprehensive revision of DNA barcode records of 7576 species distributed by four phyla of marine invertebrates, also revealed that overall, 39% of the species had ambiguous assignments [25]. As most DNA-based monitoring studies rely on poorly curated libraries to assign a taxonomic identification to sequences, quality control measures should be employed to ensure the taxonomic reliability of the data used, particularly in the case of NIS, where identifications of possible invasive species can trigger official government actions to prevent negative ecological and economic impacts or to manage a current invasion situation [32,36,37]. Datasets and databases have been developed with diverse curation levels and criteria depth and targeting different taxonomic groups. However, most of the quality control criteria employed focused more on the removal of sequences according to thresholds of minimum length and sequence quality and reliability parameters rather than on the accuracy of the taxonomic assignment [38], while others only flag sequences that seem discordant or with conflicting taxonomic classification [39][40][41]. There are also tools and software that examine available DNA sequences and evaluate its quality or possible errors that can lead to misidentifications [42]. Oliveira and co-authors [43] described a manual process to audit a reference library for European marine fishes by using a grading system to determine concordance or discordance between morphospecies and BINs, which was then used as a starting point to create the Barcode Audit & Grade System (BAGS), an automated tool for grading and auditing animal DNA barcode reference libraries [44]. However, most of these methodologies only detect and flag discordances but do not disentangle them. Given the fast increase in the use of DNA-based assessments in ecological studies and in the resulting molecular data, there is an urgent need to create curated DNA barcode reference libraries to improve the accuracy of DNA-based identifications [30,31,45,46].
With this work, we aimed: (i) to assess and analyze the COI barcode data available in the BOLD database, for a compiled list of non-indigenous marine invertebrate species for Europe, (ii) to further examine discordant data, i.e., when discordances between morphospecies and BINs exist, (iii) to provide a workflow to curate discordances (i.e., amending the solved discordances and signaling the unsolved ones), (iv) to create a curated DNA barcode reference library of European marine invertebrate NIS and, finally, (v) to assess the efficiency of the curated DNA barcode reference library in NIS detection, using DNA metabarcoding datasets obtained in recreational marinas in Portuguese coastal regions, known in advance to contain NIS.

Compilation of the Genetic Data and Curation
To compile and audit the available COI barcode data for each species in the list, we used the software BAGS ( [44]; https://github.com/tadeu95/BAGS (accessed on 20th September 2021)). This tool annotates species records in five grades (A-E) according to the congruency between species assignments and BINs, and the quantity of the sequences present in the BOLD database [27]: A and B for concordant morphospecies, (one BIN = only one species name), where these grades only differ in the quantity of available sequences in the library (grade A: > 10 sequences and grade B: ≤ 10 sequences), C for morphospecies assigned to multiple BINs, D for morphospecies with insufficient sequences (<3) and E for seemingly discordant morphospecies (more than one species name in the same BIN). With the graded library generated by BAGS, a workflow was designed ( Figure 1) to classify the BINs and further audit graded C and E records and curate the seemingly discordant morphospecies (grade E). A system of four tags was used (adapted from [25]): AMBIG (to refer to ambiguous records), MISID (to refer to misidentifications), SYN (to refer to synonyms) or SHARE (if well-established species were aggregated in the same BIN). The final tags were set out as RELIABLE, if we successfully matched a morphospecies to a single BIN, either in case of concordance or solved BIN discordances; or UNCERTAIN, if the BIN discordance remained unsolved after auditing, or if the available sequences were insufficient to resolve the taxonomic ambiguity (criteria detailed further below).
BINs assigned to species graded A and B were further tagged RELIABLE as they represent concordant records (i.e., the species is assigned to a single BIN). All grade D records were considered INSUF-UNCERTAIN as the data are insufficient (as we considered less than 5 records per BIN, and BAGS considers BINs with less than 3 records). For grade C morphospecies, a publicly available dataset comprising the involved BINs was created in BOLD (DS-NISEURC) and two analyses were used to determine if all BINs of a given grade C morphospecies were in a monophyletic (morphospecies considered RELIABLE) or non-monophyletic group (considered UNCERTAIN). A neighbor-joining (NJ) Tree was generated using the following criteria: BOLD Aligner and selection of Country/Ocean, and the BIN URI and GenBank Accession boxes (for the remaining, the default parameters were maintained). In the filters, nucleotide sequence length ≥300 bp was selected. A barcode gap analysis was also performed with the same settings (Table S4). This analysis flags the cases where the distance between the species and the nearest neighbor is <2% or lower than the maximum intraspecific distance. Grade E records were compiled into a different dataset (DS-NISEURE) and curated by performing a BIN discordance report on BOLD (no filters were selected) (Table S5). This analysis produces files containing concordant, discordant, and singleton records. While the BIN discordance report is a BIN-centered approach, the BAGS analysis is morphospecies-centered, meaning that in BOLD, a BIN that includes more than one species is considered discordant, while in BAGS, if a morphospecies has one discordant BIN assigned, it is automatically considered a discordant species (grade E), even if the other BINs are concordant. As such, some of the BINs of species graded E by BAGS can be considered concordant in the BIN discordance report. All BINs with ≤ 5 records were considered "Insufficient Records" (INSUF) and tagged UNCERTAIN, as 3 or fewer records (as used on BAGS) were not sufficient to solve certain discordances; BINs with more than 5 records were manually inspected using the following criteria:

1.
BINs were tagged SYN if the identifications (IDs) were synonyms of the accepted species name (confirmed using the WoRMS database) and were considered RELIABLE.

2.
BINs were tagged AMBIG and UNCERTAIN if: The BIN comprises more than 3 different IDs (i.e., species, genus, incomplete classification); b.
The BIN comprises 3 IDs, but one of the two IDs with the lowest number of records still contains ≥ 10 records.

3.
BINs were tagged MISID, and the discordance was solved if: a.
When there are only two IDs, and the number of records of the putative "incorrect ID" is ≤ 5% of all records in the BIN (rounded up to the unit), the species with the highest number of records is considered the correct ID; b.
The ID with the highest number of records is close to other BINs of the same species (when analyzing the NJ tree for the genus), then it is considered the correct ID; c.
The ID with the lowest number of records is from a different genus than the one that is considered the "correct ID", then it is considered the incorrect ID; d.
When analyzing the NJ tree for the genus, the ID with the lowest number of records is in a different cluster, containing only records of that same ID, then it is considered the incorrect ID.
phospecies assigned to multiple BINs, D for morphospecies with insufficient sequences (<3) and E for seemingly discordant morphospecies (more than one species name in the same BIN). With the graded library generated by BAGS, a workflow was designed ( Figure  1) to classify the BINs and further audit graded C and E records and curate the seemingly discordant morphospecies (grade E). A system of four tags was used (adapted from [25]): AMBIG (to refer to ambiguous records), MISID (to refer to misidentifications), SYN (to refer to synonyms) or SHARE (if well-established species were aggregated in the same BIN). The final tags were set out as RELIABLE, if we successfully matched a morphospecies to a single BIN, either in case of concordance or solved BIN discordances; or UNCER-TAIN, if the BIN discordance remained unsolved after auditing, or if the available sequences were insufficient to resolve the taxonomic ambiguity (criteria detailed further below).

Figure 1.
Workflow designed for the curation and audit of BAGS-graded BINs. A list of marine nonindigenous invertebrates was submitted to BAGS that produced a library of graded morphospecies. A set of criteria was developed to tag the BINs associated with the morphospecies as SYN (in case of synonyms), SHARE (if the BIN system cannot discriminate between well-established species), MISID (misidentifications), or AMBIG (ambiguous records). Final RELIABLE or UNCERTAIN tags defined whether the morphospecies identification (ID) matched to each BIN was considered reliable or uncertain.* workflow used for grade E species and with more than 5 records in the BIN discordance analysis.
For the analysis of points 3.b, 3.c, and 3.d, NJ trees were generated for the genus, family, or order (depending on the discordance level), to determine the correct ID according to each criterion.
A reference library composed of three DNA sequences for each BIN was constructed using all the RELIABLE BINs. A script was developed to select only the curated BIN records associated with morphospecies from the graded library (resultant from the BAGS analysis), with the curated identification, and then to select three random DNA sequences, prioritized according to size and availability of country data for each record. Nine size classes were created to select the sequences: (1) sequences with 658 bp; (2) ≥650 and <658 bp; (3) ≥625 and <650 bp; (4) ≥600 and <625 bp; (5) ≥575 and <600 bp; (6) ≥550 and <575 bp; (7) ≥525 and <550 bp; (8) ≥500 and <525 bp; (9) <500 and >658 bp. The script developed to create the DNA sequence reference library is available at https://github.com/tadeu95/Curated-BINs-Reference-Library (accessed on 25 January 2023).
With this curated library, a new dataset was created on the BOLD database using the reference library process IDs (DS-NISEUREF).
2.3. Testing the Impact of the Curated Reference Library on the Accuracy and Amount of NIS Detected through DNA Metabarcoding of Natural Communities High-throughput sequencing data were generated from three types of samples collected in a recreational marina (Costa Nova, Portugal: 40 • 37 11.6" N, 8 • 44 55.7" W): zooplankton collected from the water column (with a plankton net with 55 µm mesh size), water eDNA (1 L collected at 1 m depth), and a sample of marine invertebrates scraped from hard substrates (e.g., pontoons, cables, buoys), to conduct this analysis (considering only the marine or oligohaline invertebrates detected). Briefly, zooplankton and eDNA samples were vacuum filtered through 0.45 µm, 47 mm nitrocellulose membranes (Millipore, Corp., Bedford, MA, USA) and DNA from half of the filters was extracted using the DNeasy ® PowerSoil ® Kit (Qiagen, Hilden, Germany) according to manufacturer's instructions. Marine invertebrate samples were preserved in absolute ethanol, and DNA extraction was carried out using the protocol described in [51]. High-throughput sequencing (HTS) was carried out at Genoinseq (Biocant, Cantanhede, Portugal) in an Illumina MiSeq ® platform using the primer pair mlCOIintF (5 -GGWACWGGWTGAACWGTWTAYCCYCC-3 ) [52] and LoboR1 (5 -AAACYTCWGGRTGWCCRAARAAYCA-3 ) [53] to amplify a 313 bp region of the COI gene.
MiSeq data were analyzed on the Multiplex Barcode Research And Visualization Environment (mBRAVE; http://mbrave.net/ (accessed on 15th June 2022)) [54], and three tests were performed, which differed mainly in the reference sequences libraries used to conduct the taxonomic assignments: (1) BOLD system reference libraries, namely SYS-CRLCHORDATA, SYS-CRLINSECTA, SYS-CRLNONARTHINVERT, and SYS-CRLNONIN-SECTARTH; (2) a dataset containing only records from grade A, B, and C (audited with BAGS); and (3) our curated reference library. mBRAVE settings were as follows: Trimming-Trim front: 0 bp, Trim end: 0 bp, Trim length: 313 bp, Primer masking: off; Filtering-Min QV: 10, Min length: 150 bp, Max bases with low QV: 25%, Max bases with ultralow QV: 25%; No pre-clustering, ID distance threshold: 3%, Exclude from OTU threshold: 3%, Minimum OTU size: 1, OTU threshold: 3%; Paired End Merging: Pool, Assembler min overlap: 20 bp, Assembler max substitutions: 5 bp. This allowed us to compare results on the accuracy and number of identifications using non-curated libraries (BOLD system libraries), an audited library, and a fully curated library. For the analysis of the results using the non-curated library, we eliminated all species that were not marine or brackish invertebrates, records with hits to taxonomic ranks higher than species level (as for NIS, information at the species level is mandatory), and records with hits with more than one species/taxon.

List of Non-Indigenous Invertebrate Marine Species Occurring in Europe
After the removal of duplicated records and records with taxonomic ranks higher than species level, our final list contained 1249 non-indigenous species (Table S1), of which 38% were found in both EASIN and AquaNIS, 21% were retrieved exclusively from AquaNIS, 40% from EASIN and 1% from WRiMS. The species were distributed into 16 phyla, with a dominance of Arthropoda (28%), Mollusca (28%), and Annelida (18%) (Figure 2).
After the removal of duplicated records and records with taxonomic ranks higher than species level, our final list contained 1249 non-indigenous species (Table S1), of which 38% were found in both EASIN and AquaNIS, 21% were retrieved exclusively from AquaNIS, 40% from EASIN and 1% from WRiMS. The species were distributed into 16 phyla, with a dominance of Arthropoda (28%), Mollusca (28%), and Annelida (18%) (Figure 2).

Curation of the Taxonomic Assignments of Barcode Records
Of the 1249 NIS, only 42% (530 species) had COI barcode sequences in BOLD (Table  S2), and the most well-represented groups with sequence data were the same that dominated the species list (Arthropoda, Mollusca, and Annelida). For these 530 species, BAGS retrieved 25,291 records belonging to 1105 BINs (Table S3). Only 11.4% of these BINs were assigned to morphospecies graded concordant (A + B), 23% to morphospecies graded C (more than 1 BIN per species, with the species being the exclusive member of the BINs), 9% to morphospecies graded D (insufficient data), and the majority (56.3%) to morphospecies graded E (discordant) (Figure 3). Arthropoda and Mollusca were the most wellrepresented phyla in all grades (ranging from 20% to 45% of the BINs in each grade). European NIS belonging to Acanthocephala, Chaetognatha, Nemertea, Platyhelminthes, and Porifera do not have records with concordant BINs (A and/or B); however, these BINs only represent 3% of the total number of BINs in the dataset.

Curation of the Taxonomic Assignments of Barcode Records
Of the 1249 NIS, only 42% (530 species) had COI barcode sequences in BOLD (Table S2), and the most well-represented groups with sequence data were the same that dominated the species list (Arthropoda, Mollusca, and Annelida). For these 530 species, BAGS retrieved 25,291 records belonging to 1105 BINs (Table S3). Only 11.4% of these BINs were assigned to morphospecies graded concordant (A + B), 23% to morphospecies graded C (more than 1 BIN per species, with the species being the exclusive member of the BINs), 9% to morphospecies graded D (insufficient data), and the majority (56.3%) to morphospecies graded E (discordant) (Figure 3). Arthropoda and Mollusca were the most well-represented phyla in all grades (ranging from 20% to 45% of the BINs in each grade). European NIS belonging to Acanthocephala, Chaetognatha, Nemertea, Platyhelminthes, and Porifera do not have records with concordant BINs (A and/or B); however, these BINs only represent 3% of the total number of BINs in the dataset.
All BINs corresponding to grade C morphospecies were considered RELIABLE as all groups were monophyletic ( Figure S1). The BIN discordance report that was created for grade E records on BOLD separates the BINs into three categories: singletons (BINs with a single record), concordant (BINs with no taxonomic discordance), and discordant (BINs with taxonomic discordance). For the 622 graded E BINs, this report indicated that 20% were singletons, 31% were concordant, and 49% were discordant (Figure 4). Singletons in graded E BINs are likely the result of private records (records that researchers have deposited on BOLD but are not publicly available); these were immediately tagged UNCERTAIN since they hold insufficient records. Concordant and discordant records were analyzed case by case following the established workflow ( Figure 1 (Figure 4). Most BINs tagged SYN, SHARE and INSUF belonged to Mollusca (77%, 88%, and 38%, respectively), while Arthropoda and Mollusca dominated the MISID (42% and 30%, respectively) and AMBIG records (28 and 26%, respectively) ( Figure 5). After the curation, 216 BINs (35%) were solved and considered RELIABLE. A total of 70% of all AMBIG records were tagged INSUF, which represents 34% of all BINs (378 in the total 1105 BINs) (Figure 4). All BINs corresponding to grade C morphospecies were considered RELIABLE as all groups were monophyletic ( Figure S1). The BIN discordance report that was created for grade E records on BOLD separates the BINs into three categories: singletons (BINs with a single record), concordant (BINs with no taxonomic discordance), and discordant (BINs with taxonomic discordance). For the 622 graded E BINs, this report indicated that 20% were singletons, 31% were concordant, and 49% were discordant ( Figure 4). Singletons in graded E BINs are likely the result of private records (records that researchers have deposited on BOLD but are not publicly available); these were immediately tagged UNCER-TAIN since they hold insufficient records. Concordant and discordant records were analyzed case by case following the established workflow ( Figure 1). Of the 194 concordant BINs, 99 (51%) were considered UNCERTAIN (91 INSUF). Three BINs (BOLD:AAA2185, BOLD:AAA4734, and BOLD:ACQ2249) assigned to species of the Mytilus genus were considered UNCERTAIN as the Mytilus edulis species complex is composed of three closely related species: M. edulis Linné, 1758, M. galloprovincialis Lamarck, 1819 and M. trossulus Gould, 1850 [55]. Of the 302 discordant, 174 were tagged AMBIG, 92 MISID, 31 SYN, and 5 SHARE (Figure 4). Most BINs tagged SYN, SHARE and INSUF belonged to Mollusca (77%, 88%, and 38%, respectively), while Arthropoda and Mollusca dominated the MISID (42% and 30%, respectively) and AMBIG records (28 and 26%, respectively) ( Figure 5). After the curation, 216 BINs (35%) were solved and considered RELIABLE. A total of 70% of all AMBIG records were tagged INSUF, which represents 34% of all BINs (378 in the total 1105 BINs) (Figure 4).    . Results from the BIN discordance report performed on the BOLD database. This analysis separates the BINs into three categories: singletons (BINs with only one record), concordant (BINs with no taxonomic discordance), and discordant (BINs with taxonomic discordance). The BINs were then tagged according to the designed workflow into: AMBIG (ambiguous records), INSUF (insufficient records), MISID (misidentifications), SYN (synonyms), or SHARE (if the BIN system was not able to discriminate well-established species).

Impact of the Curated Reference Library on the Accuracy and Amount of NIS Detected through DNA Metabarcoding of Natural Communities
Only RELIABLE BINs were included in the final curated reference library (Table S6). This library included 597 BINs (54% of the initial number) corresponding to 356 species, of which 21% were to grade A + B morphospecies, 43% to grade C, and 36% corresponded to the original graded E that were deemed RELIABLE after applying the curation

Impact of the Curated Reference Library on the Accuracy and Amount of NIS Detected through DNA Metabarcoding of Natural Communities
Only RELIABLE BINs were included in the final curated reference library (Table S6). This library included 597 BINs (54% of the initial number) corresponding to 356 species, of which 21% were to grade A + B morphospecies, 43% to grade C, and 36% corresponded to the original graded E that were deemed RELIABLE after applying the curation workflow (Table S8). If the post-audit library was assembled only with records graded A, B, and C, it would only consist of 384 BINs (36% fewer BINs than the curated reference library created using our grade E-curation workflow) (Table S7).
High-throughput sequencing reads analysis of the three samples (eDNA, zooplankton, and fouling marine invertebrate samples) against the non-curated libraries was assigned to a total of 29 marine invertebrate species (Table 1) out of a total of 44 species. The highest number of NIS was detected using our curated library (16 NIS), followed by the non-curated libraries (12 NIS) and the audited library (9 NIS) ( Table 1). Haliclystus tenuis, Scruparia ambigua, Syllidia armata, and Terebella lapidaria were exclusively detected with the non-curated libraries. Upon further investigation, we determined that the records from Haliclystus tenuis and Terebella lapidaria belonged to BINs corresponding to grade D morphospecies (insufficient records), and the BINs of the other two species were not present in the BOLD database (probably private records)-rendering these results as uncertain detections. Eight NIS were not detected using the non-curated libraries but were present in either one or both the audited and curated libraries. Amphibalanus improvisus, Paracalanus indicus, and Ruditapes philippinarum were exclusively detected with our curated library.

Discussion
By analyzing the availability and quality of the COI records belonging to non-indigenous invertebrate marine species occurring in Europe, our study brings to the forefront six main conclusions: (1) most species still lack genetic information for this marker (58%); (2) the majority of the existent records were graded discordant (56%); (3) data curation can increase the number of concordant records (36%); (4) the use of a curated database can increase the chances of detecting NIS (33%); (5) some NIS can be categorized as possible cryptic species complexes (23%); and (6) many BINs were represented by singletons (i.e., only one sequence record, 20% of all graded E morphospecies BINs).
DNA metabarcoding has proven to be a powerful tool in biomonitoring and species detection [56]. In particular, the ability to detect premature life stages, such as propagules, larvae or small juveniles, or recently introduced species that are still at a low density and confined to a small area, maximizes the early detection of NIS before harming ecosystems [57]. However, the use of DNA metabarcoding, in particular at the taxonomic assignment step, is dependent on the availability of taxonomically accurate sequence data in public databases. In this study, we verified that 58% of the species analyzed did not have publicly available COI sequence data. Gaps of up to 90% in DNA barcodes have been reported earlier for different taxonomic groups of marine invertebrates [29,30,32,58]. These gaps can probably be explained by the difficulties inherent in generating reference libraries for this phylogenetically very diverse set of taxa, for example, in the initial identification of specimens through morphology-based approaches [59] or the lower PCR amplification success rates when using universal primers [53], and also a lower number of studies and commitment to produce reference libraries for marine invertebrates, in comparison to other groups such as fishes or freshwater invertebrates [32]. This was also noticeable in the current study by the significant number of UNCERTAIN records that were tagged "INSUF" due to insufficient records in the database (70%).
Although there are fewer studies aiming at completing and auditing DNA barcode reference libraries of marine invertebrates, recently, there have been some initiatives worth mentioning. During the 8th iBOL conference hackathon in 2019, a group of researchers reviewed 83,712 DNA barcode records belonging to four major taxonomic groups of marine invertebrates (Echinodermata, Mollusca: Bivalvia and Gastropoda, Annelida: Polychaeta and Arthropoda: Crustacea), which resulted in more than 2700 records flagged and removed from BOLD [25]. In addition, the project GEANS aims to verify taxonomically over 90% of the macroinvertebrate species of the North Sea [60]. For this reference library, quality control will be performed by a taxonomic curator. By 2021, this reference library covered over 30% of North Sea species, but morphological identification can be a time-consuming process that becomes more difficult at a wider European scale. In contrast, we concur that taxonomic expertise is crucial to guarantee accurate identifications of the specimens used to generate the DNA barcodes for records flagged as discordant already deposited in genetic databases. A taxonomic congruency check, such as the one performed in the current study, may be the most straightforward and fast approach or even the only one available for records lacking vouchers [25]. However, in the cases where the discordances are not solvable (and in our study, were 65% of grade E BINs), the taxonomic revision of the original specimens (if stored properly, and for that repositories would remain crucial) is the only solution. For instance, Ciona intestinalis and C. robusta (Chordata: Tunicata) were recently reclassified since there were many sequences of C. robusta on GenBank falsely attributed to C. intestinalis [61,62]. In addition, sequences of Botrylloides diegensis (Chordata: Tunicata) (a global marine invader) were found to be erroneously assigned to B. leachii (a putatively native species in Europe) [63], and sequences from Acartia tonsa (Arthropoda: Crustacea), an invasive species recorded in many ecoregions of the world, including in several European countries, formed several distinct clades, some of which clustered with A. hudsonica [64].
Our analysis highlighted that in addition to the lack of COI barcode data for these species, the majority of the BINs were graded discordant. This has also been found for Macaronesian non-indigenous marine invertebrates, where approximately 50% of the species were graded discordant [31], which can represent a major constraint for the use of DNA metabarcoding in studies targeting marine invertebrate communities. In the particular case of NIS, species misidentification can be even more problematic since it can trigger both action or inaction when not required.
Comprehensive and curated reference libraries are imperative for accurate DNAbased taxonomic identifications [65]. However, a standard procedure for the curation of BINs is still lacking. Guidelines for the curation of specimen vouchers and barcode data and metadata, as recently published by Rimet and collaborators [35], are an important component of quality control, but a further taxonomy congruency check is still necessary at the end of the reference library production chain [66]. With our curation workflow based on an automated system (BAGS), we first detected the discordant records, and by inspecting each, we were able to increase the number of concordant records using a conservative approach. We also strived for a simple workflow that can be easily employed, at least for moderately sized sets of species. However, our curation protocol also has its own limitations because it is based only on the congruence between morphospecies and BINs. Moreover, for this purpose, we considered it important to use all data available. The workflow did not evaluate the associated metadata. In addition, occasionally, records that were previously considered reliable or morphospecies concordant can be later considered uncertain, or vice-versa, due to the frequent update of BIN information and taxonomic revision. This is also one of the reasons, in case of doubt or lack of representativity, why we opted for a conservative approach in the current study, where most curated BINs had explicit ambiguities of simple resolution. In addition, as previously mentioned, the fact that this workflow involves manual curation (albeit initially supported by BAGS) makes it challenging for bigger datasets. However, we believe that the criteria can be automated, which would facilitate the work of researchers, as well as any time databases are updated. Moreover, finally, even if some criteria do not guarantee the most accurate identification (indeed, some records remained uncertain and unsolvable, at least until more data becomes available), we still believe that this curation workflow is a quick and reliable approach, as some records can, in fact, be considered reliable and definitely solved (e.g., records tagged SYN, MISID).
In our study, we also detected almost twice the number of BINs compared to the number of morphospecies, flagging the potential presence of hidden diversity for several NIS occurring in European waters. Indeed, species complexes have been uncovered for many invasive species (e.g., Bryozoa: Bugula neritina, Watersipora sp. [67,68]) and marine invertebrates in general [69][70][71]. This highlights that reference libraries should cover a balanced representation of specimens across each species' distributional range, including native and recipient locations, to account for possible genetic variability among different geographic regions. In addition, we found a large fraction of NIS represented by singletons, preventing their use as reliable records and the detection of possible intraspecific variability or unrecognized diversity.
As a result of our analysis, we were able to assemble a reliable reference library for NIS using only concordant records (597 BINs) belonging to 356 species with confirmed occurrence in Europe. When the trial metabarcoding datasets were matched against this curated library, the number of NIS detected increased by a third. Furthermore, the comparative tests using the non-curated library, an audited library, and the final curated reference library revealed the possible existence of four false detections. Two of the species that were exclusively detected using the non-curated libraries were associated with BINs that were not publicly available on BOLD. In our opinion, these could be either private records, which can represent a major issue since researchers cannot verify the detailed information, and, thus, audit and guarantee its credibility, or records that, in the meantime, were removed from BOLD but not immediately updated on BOLD imported databases on mBRAVE. Weigand and collaborators [32] showed that for some groups, such as Annelida and Sipuncula, the number of private records was even higher than that of public records. The other two species detected using the non-curated libraries had insufficient records in the database, which renders them uncertain using our defined standards for this study. Without proper curation, these four "false detections" could lead to false conclusions during a biomonitoring study, which highlights the importance of proper curation of taxonomic assignments of barcode records. The use of a curated dataset also revealed four NIS that were not initially detected using a non-curated library. Testing our metabarcoding reads against these three datasets showed that data curation is crucial to improve species detection, as previously reported [72]. Importantly, it also revealed that the effort to curate discordant data should be employed instead of just removing it from the datasets, as the audited library recovered the lowest number of NIS.

Conclusions
Although completing the gaps in reference libraries is essential for making the most of the potential of the DNA metabarcoding in NIS surveillance in European marine and coastal ecosystems [23,33], a careful compilation, verification, and annotation of available sequences is also crucial to support rigorous species identifications, as demonstrated in our study. This can have major implications as introduced species can be misidentified as putative native species or vice-versa when employing DNA-based tools and conducting taxonomic assignments against non-curated databases. Unfortunately, and as also flagged by our study, these database errors are frequent, and thus, auditing existing records and building/compiling curated reference libraries with reliable taxonomic assignments is equally important as generating new reference sequences for NIS.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/d15020174/s1, Figure S1: Neighbor-joining (NJ) tree generated on BOLD using all the records from grade C morphospecies, Table S1: List of the non-indigenous invertebrate marine species occurring in Europe retrieved in September of 2021 from the databases EASIN, AquaNIS, and WRiMS, Table S2: List of the non-indigenous invertebrate marine species occurring in Europe without available COI barcode data in the BOLD database, Table S3: Graded library generated by BAGS using the list of the non-indigenous invertebrate marine species occurring in Europe containing available COI barcode data, Table S4: Barcode gap analysis performed on BOLD for grade C morphospecies using the dataset DS-NISEURC, Table S5: Analysis of graded E records and assignment of TAGS using the BIN discordance report performed on BOLD with the dataset of grade E records (DS-NISEURE), Table S6: List of curated RELIABLE BINs, Table S7: Audited library composed of reference sequences for each BIN (using BINs graded A, B, and C by BAGS), compiled using the developed script, and Table S8: Curated library composed of reference sequences for each RELIABLE BIN, compiled using the developed script.

Data Availability Statement:
The script developed to create the DNA sequence reference library is available at https://github.com/tadeu95/Curated-BINs-Reference-Library (accessed on 25 January 2023). Metabarcoding datasets will be made available upon request since they belong to a study that is not published yet.