Next-Generation DNA Barcoding for Fish Identiﬁcation Using High-Throughput Sequencing in Tai Lake, China

: Tai Lake, an important biodiversity hotspot of the lower reaches of the Yangtze River in China, possesses its characteristic ﬁsh fauna. Barcoding on native species is important for species identiﬁcation and biodiversity assessment with molecular-based methods, such as environmental DNA (eDNA) metabarcoding. Here, DNA-barcoding coupled with high-throughput sequencing (HTS) and traditional Sanger sequencing was introduced to barcoding 180 specimens belonging to 33 prior morphological species, including the most majority of ﬁsh fauna in Tai Lake. HTS technology, on the one hand, signiﬁcantly enhances the capture of barcode sequences of ﬁsh. The successful rate of ﬁsh barcoding was 74% and 91% in Sanger and HTS, respectively. On the other hand, the HTS output has a large number (64%) of insertions and deletions, which require strict bioinformatics processing to ensure that the “true” barcode sequence is captured. Cross-contamination and parasites were the primary error sources that compromised attempts at the DNA barcoding of ﬁsh species. The barcode gap analysis was 100% successful at delimiting species in all specimens. The automatic barcode gap discovery (ABGD) method grouped barcode sequences into 34 OTUs, and some deep divergence and closed species failed to obtain corresponding OTUs. Overall, the local species barcode library established by HTS barcoding here is anticipated to shed new light on conserving ﬁsh diversity in Tai Lake.


Introduction
It is well recognized that the loss of biodiversity caused by environmental deterioration has adverse ecological consequences [1], for example, nutrient over-enrichment in lakes leads to the simplification of flora and fauna [2].Biodiversity conservation is a usual tool for sustaining the health of an ecosystem [3].Species delimitation, especially identifying functional species, is the first step to understanding the relationship between biodiversity and ecological services [4][5][6].
A conventional taxonomist uses the morphological method to describe species, and misidentification often occurs because of phenotypic plasticity and differing life stages [7,8].In addition, traditional fish surveys generally involve capturing organisms, are invasive for the biological community under study, and conflict with the original intention of biodiversity conservation.The DNA barcode, usually a short and standardized sequence, has emerged as a cost-effective proxy for species identification [9,10], which significantly improves the efficiency of biomonitoring [11,12].Incomplete reference databases of the DNA barcode, especially for native species, are widely recognized as a major obstacle to the use of molecular-based methods.
The barcode database that was built needed a PCR procedure firstly, which amplified not only target region but also non-intended fragments.Subsequently, Sanger sequencing was used to obtain the target barcode sequence.Although Sanger sequencing is viral and is low-cost, it can only provide a single sequence for each specimen.The single sequence from Sanger sequencing can be the PCR product of the co-amplification of contaminated DNA and may not represent the 'true' barcode sequence, which increases the risk of failure [13].
High-throughput sequencing (HTS) allows for sequencing millions of DNA fragments in parallel and provides an opportunity to reveal the sequence composition of the PCR product [14].In addition, HTS allows for the generation of multiple sequences for a single specimen and provides an opportunity to identify the contamination.HTS barcoding not only enhances the accuracy of specimen identification but also accelerates the process of barcode capturing [14,15].The Ion Torrent sequencing platform, in particular, can provide tens of millions of sequences within 10 h for the barcode testing of hundreds of specimens [16,17].It dramatically improves the efficiency of database construction and reduces the cost.In this study, we established a DNA barcoding library of fish from Tai Lake using HTS and analyzed the cytochrome c oxidase I (COI) barcode gap among fish.The significance of the library was to provide a database for fishery resource surveillance and promote the taking of more careful measures in the conservation of fishery biodiversity.

Sampling
Tai Lake, located in the lower reaches of the Yangtze River, is the third largest freshwater lake and harbors a large number of freshwater fish species (Figure 1).Algae bloom caused by eutrophication significantly threatens the lake's ecological health [2] and has led to a water crisis [18].There were approximately 107 fish species belonging to 14 orders, 25 families, and 73 genera in Tai Lake historically, of which Cyprinidae accounted for more than 60% [19].The diversity of fish fauna in Tai Lake decreased dramatically, and no more than 40 fish species were found in traditional fishery resource surveillance conducted by a local fishery institute in recent years [20].In this study, 7 sampling sites were arranged, and a total of 180 specimens belonging to 34 species were collected with gill nets and ground cages.There were 130 specimens of Cyprinidae and 50 specimens representing another ten families [19] (Figure 2).

DNA Isolation and PCR Amplification
The fish specimens were identified by an empirical taxonomist and were preserved in 95% ethanol.About 0.5 mg tail fin tissue samples were collected, and the DNA was extracted using the QIAGEN DNeasy Blood and Tissue Kit (QIAGEN, Hilden, Germany) according to the manufacturer's protocol.PCR amplification was performed in a final volume of 50 µL, made up of 1 µL of 10 µM of universal forward (GGWACWG-GWTGAACWGTWTAYCCYCC) and reverse (TAAACTTCAGGGTGACCAAARAAYCA) primers [9], 37.8 µL of ultrapure water, 5 µL of 10 × PCR High Fidelity PCR buffer, 2 µL of MgSO 4 (50 mM), 1 µL of dNTP mix (10 mM), 0.2 µL of Platinum Taq DNA polymerase, and 2 µL of DNA template (Invitrogen, Waltham, MA, USA, USA).Amplification was performed using a "Touchdown" procedure.PCR conditions were 95 • C for 5 min, 16 cycles of 95 • C for 10 s, 62 • C for 30 s (−1 • C per cycle), and 72 • C for 60 s, followed by 20 cycles of 95 • C for 10 s, 46 • C for 30 s, and 72 • C for 60 s.Negative control was included in the experiment.PCR products were detected by 2% agarose gel.

Sequencing
Each amplicon was indexed with unique 10-mer multiple identifiers (MIDs).The PCR amplicons were pooled in equal volumes.The purified library was constructed using the Ion Torrent PGM template OT2 400 kit (Life Technologies, Carlsbad, CA, USA) and subsequently sequenced on the "318 V2" chip according to the protocols.At least 100 sequences were provided for each sample.The sequence of each amplicon was also obtained using Sanger sequencing (ABI 3730XL sequencer, Applied Systems Inc., USA) side by side.S1).S1).

Bioinformatics
All HTS reads were filtered for quality and length and reverse-complemented by the "Biostrings" package of R language [21].Secondly, we programmed a pipeline using R language (involving the "DECIPHER", "seqinr", and "ShortRead" packages) to trim MIDs and forward and reverse primers and to assign the filtered reads to each specimen [22][23][24].The barcode region of each sequence was retained for further processing.
De-replication was conducted by USEARCH [25], and the number of unique sequences each barcode contained, namely, its frequency, was counted simultaneously.Evidently, each specimen had more than one unique sequence owing to contamination and sequence error.Contamination was removed using a genetic distance method.The K2P (Kimura 2-parameter) distance of unique sequences in each specimen was calculated by the "seqinr" package and plotted by nMDS (non-metric multidimensional scaling) [23,26,27].More than one cluster was observed if containment existed in this specimen.Among all the clusters, the biggest one was retained as a containment-free unique sequence, and other clusters were considered as containment and identified by BLAST against GenBank [28].
Water 2023, 15, x FOR PEER REVIEW 4 of 14 Figure 2. Historical list of fish species in Tai Lake, and specimens captured in this study.There were 107 species, historically belonging to 1 class, 14 orders, and 25 families; cyprinidae fish accounted for more than 60 percent of the total species.In total, 180 specimens were captured, belonging to 34 species and 6 orders in this study; among these specimens, cyprinidae also was the most dominant family.

DNA Isolation and PCR Amplification
The fish specimens were identified by an empirical taxonomist and were preserved in 95% ethanol.About 0.5 mg tail fin tissue samples were collected, and the DNA were extracted using the QIAGEN DNeasy Blood and Tissue Kit (QIAGEN, Hilden, Germany) according to manufacturer's protocol.PCR amplification was performed in a final volume of 50 μL, made up of 1 μL of 10 μM of universal forward (GGWACWGG-WTGAACWGTWTAYCCYCC) and reverse (TAAACTTCAGGGTGACCAAARAAYCA) primers [9], 37.8 μL of ultrapure water, 5 μL of 10 × PCR High Fidelity PCR buffer, 2 μL of MgSO4 (50 mM), 1 μL of dNTP mix (10 mM), 0.2 μL of Platinum Taq DNA polymerase, and 2 μL of DNA template (Invitrogen, Waltham, MA, USA, USA).Amplification was performed using a "Touchdown" procedure.PCR conditions were 95 °C for 5 min, 16 cycles of 95 °C for 10 s, 62 °C for 30 s (−1 °C per cycle), and 72 °C for 60 s, followed by 20 cycles of 95 °C for 10 s, 46 °C for 30 s, and 72 °C for 60 s.A negative control was included in the experiment.PCR products were detected by 2% agarose gel.

Sequencing
Each amplicon was indexed with unique 10-mer multiple identifiers (MIDs).The PCR amplicons were pooled in equal volume.The purified library was constructed using the Ion Torrent PGM template OT2 400 kit (Life Technologies, Carlsbad, CA, USA) and subsequently sequenced on the "318 V2" chip according to the protocols.At least 100 sequences were provided for each sample.The sequence of each amplicon was also obtained using Sanger sequencing (ABI 3730XL sequencer, Applied Systems Inc., USA) side by side.Historical list of fish species in Tai Lake, and specimens captured in this study.There were 107 species, historically belonging to 1 class, 14 orders, and 25 families; cyprinidae fish accounted for more than 60 percent of the total species.In total, 180 specimens were captured, belonging to 34 species and 6 orders in this study; among these specimens, cyprinidae also was the most dominant family.
However, there was more than one containment-free unique sequence owing to substitution error.Subsequently, containment-free unique sequences were sorted at the frequency by USEARCH software [25], and one single highest frequency unique sequence was selected as the "true" barcode; if a medium-high frequency unique sequence was detected, a Numt sequence was suspected [29] and validated by comparison to GenBank; other verylow-frequency unique sequences were considered as substitution errors and validated by BLAST against the selected "true" barcode database.Through validation, substitution error unique sequences were generated by substituting one or two bases of a "true" barcode [30].

Phylogenetic Analysis
The effectiveness of the DNA barcode was validated by two exact opposite procedures, including (1) using the DNA barcode to discriminate between species or to check if the barcode gap exists; (2) successfully predicting unexplored specimens by grouping sequences into OTUs (operational taxonomic units).
The genetic K2P distance-based DNA barcoding method was used to discriminate between species [26].To visually observe distance within species, two-dimensional nMDS was plotted using a "vegan" package [27].A comparison between the maximum distance within species and the minimum distance between species in each specimen was performed using the "spider" package to examine the barcode gap [31].K2P and p-distance-based NJ (neighbor-joining) trees were implemented in MEGA 6.0 using 1000 bootstrap replicates [32].
Barcode sequences were further clustered into OTUs for the purpose of predicting unexplored specimens.The ABGD (automatic barcode gap discovery) approach on a web interface (www.abi.snvjussieu.fr/public/abgd,accessed on 26 November 2019) was applied using the default parameter (X = 1; K2P button was chosen).This approach is a model for the purpose of seeking the threshold to cluster sequences into hypothetical species [33].

Results
A total of 149,892 raw reads were generated by HTS (Figure 3).The raw reads ranged in length from 19 to 420 bp.After filtering quality and trimming MIDs and primers, 108,689 sequences remained, and an average of 603 sequences was assigned to each specimen.Following indel, contamination, and substitution removal, a total of 34,733 error-free reads remained.More than 64% of the error reads were generated by indel, and fewer than 500 reads were contamination reads, including fish cross-contamination and parasites.In each specimen, PGM HTS sequencing errors were proportional to sequencing reads (Figure 4).Owing to the degeneration of primers, some specimens had minor raw reads.Generally, at least 100 error-free reads could successfully be obtained by PGM sequencing per specimen (Figure 4).However, if fish tissues were contaminated by parasites, this would be judged a failure to get the "true" barcode.A total of 9% and 26% of specimens could not obtain "true" barcode sequences with HTS and Sanger sequencing, respectively (Figure 4).Finally, 163 "true" barcode sequences were recovered by PGM sequencing.All barcodes were of a full 313 bp length (Figure 3).Each specimen had only one unique sequence that remained, meaning that an average of 231 error-free reads was assigned to each "true" barcode.
Water 2023, 15, x FOR PEER REVIEW 6 of 14 reads (Figure 4).Owing to the degeneration of primers, some specimens had minor raw reads.Generally, at least 100 error-free reads could successfully be obtained by PGM sequencing per specimen (Figure 4).However, if fish tissues were contaminated by parasites, this would be judged a failure to obtaining the ''true'' barcode.A total of 9% and 26% of specimens could not obtain ''true'' barcode sequences with HTS and Sanger sequencing, respectively (Figure 4).Finally, 163 "true" barcode sequences recovered by PGM sequencing.All barcodes were of a full 313 bp length (Figure 3).Each specimen had only one unique sequence that remained, meaning that an average of 231 error free reads was assigned to each "true" barcode.All 163 barcodes recovered by HTS belonged to 33 of 34 a priori morphologica identified species, whereas the Tridentiger bifasciatus species could not be recovered.It w expected that K2P-based genetic variation hierarchically increased from within spec (mean = 0.3%, SE = 0.03), to within genera (mean = 3.82%, SE = 0.23), and to within famil (mean = 17.40%,SE = 0.09) (Table 1).Using nMDS to reduce the dimension of genetic d tances within species, specimens within species were clustered respectively in the tw dimensional plot.Cypriniformes was separated from other orders.Of specimens in t Cypriniformes order, the Acheilognathus family was separated from other families (F ure 5).Overall, a comparison between the maximum distance intra-species and the mi imum distance inter-species demonstrated that a barcode gap existed in all of the analyz specimens (Figure 6).
The phylogenetic tree based on the K2P distance contained 33 species clusters.T number of OTUs produced by the ABGD method was manifested by a red circle outsi the NJ tree (Figure 7).The ABGD analysis produced nine initial partitions.The number groups and the p distance were in the range of 32 to 49 and of 0.059948 to 0.001000, r spectively.The result of 34 OTUs (p distance = 0.012915) was chosen to set the thresho to delimit species boundaries since it was concordant with the outcome of NJ analysis.All 163 barcodes recovered by HTS belonged to 33 of 34 priori morphologically identified species, whereas the Tridentiger bifasciatus species could not be recovered.It was expected that K2P-based genetic variation hierarchically increased from within species (mean = 0.3%, SE = 0.03), to within genera (mean = 3.82%, SE = 0.23), and to within families (mean = 17.40%,SE = 0.09) (Table 1).Using nMDS to reduce the dimension of genetic distances within species, specimens within species were clustered respectively in the two dimensional plot.Cypriniformes was separated from other orders.Of specimens in the Cypriniformes order, the Acheilognathus family was separated from other families (Figure 5).Overall, a comparison between the maximum distance intra-species and the minimum distance inter-species demonstrated that a barcode gap existed in all analyzed specimens (Figure 6).The maximum distance within species The minimum distance between species Figure 6.The maximum distance within species in each specimen compared with the minimum distance between species in each specimen.All specimens fall above the 1:1 line, indicating the existence of a barcode gap.
The phylogenetic tree based on the K2P distance contained 33 species clusters.The number of OTUs produced by the ABGD method was manifested by a red circle outside the NJ tree (Figure 7).The ABGD analysis produced nine initial partitions.The number of groups and the p distance were 32 to 49 and 0.059948 to 0.001000, respectively.The result of 34 OTUs (p distance = 0.012915) was chosen to set the threshold to delimit species boundaries since it was concordant with the outcome of NJ analysis.Of these 34 OTUs, two species, Misgurnus anguillicaudatus and Monopterus albus, were delimited to two theoretical species, respectively, whereas Megalobrama amblycephala and Megalobrama skolkovii were clustered into one candidate species.Other species boundaries represented by OTUs were concordant with morphologically identified species.

Discussion
Some major questions in ecology, such as what constitutes the dietary range o species and the assembly of ecological communities, are hampered by tradition phological identification, owing to laborious work [34,35].DNA barcoding, inte ecological, morphological, and generic data, is anticipated to bring the renaissance onomy [36].The barcode library of fish species established here not only provided fective tool in identifying fish communities in Tai Lake [37] but also accurately me the dietary range of some functional fish species.

Discussion
Some major questions in ecology, such as what constitutes the dietary range of a fish species and the assembly of ecological communities, are hampered by traditional morphological identification, owing to laborious work [34,35].DNA barcoding, integrating ecological, morphological, and genetic data, is anticipated to bring the renaissance of taxonomy [36].The barcode library of fish species established here not only provided an effective tool for identifying fish communities in Tai Lake [37] but also accurately measured the dietary range of some functional fish species.
Sanger sequencing is the dominant approach for obtaining barcode sequences and has been applied to establish a wide range of barcode libraries, from phytoplankton to vertebrates [11,[37][38][39][40].However, no-amplification and co-amplification of non-target sequences usually occur and decrease the efficiency and accuracy of the capture of the "true" barcode [41].In this study, 26% of the analyzed specimens could not be sequenced by Sanger sequencing.When the HTS approach was used instead, the failure rate was reduced to 9%.Because "touch-down" PCR was performed, the HTS approach could obtain sequences from some specimens with low concentrations of amplifications where there was no measurable fluorescence detected by electropherogram [42].The HTS approach increased the sensitivity of the capture of the DNA barcode.A previous study where an average of 143 sequences per specimen were generated by the HTS approach provided the proof for our research [15].
The Ion torrent PGM sequencer was chosen owing to more time-saving in comparison with other sequencers, such as Illumina HiSeq and 454 Junior [17].However, every sequencer introduces errors in the read results.Because of the Ion Torrent based on pH flow call technology, the indel errors occur at very high frequencies when sequencing in massively parallel [43].A SEAME tool was used to remove the error reads in bioinformatics [44,45].We chose an irreproachable COI fragment sequenced by the Sanger approach in a bidirectional way as a gold template for BLAST, with all reads generated by the PGM [28].Indel reads accounted for 64% of the total reads.In a previous study, the error rate was as high as 90%.The difference in the error rate may be due to the different uses of the sequencing kit or the chip density [43].
The biggest advantage of the HTS is the successful acquisition of hundreds of reads in parallel [14], allowing us to thoroughly understand the sequence composition in a single specimen, similarly to multiple clones in Sanger sequencing [46].Previous studies have found that the "true" barcode was often confused with endosymbiotic bacteria (e.g., Wolbachia) [47], cross-contamination [15], and heteroplasmy [41].In this study, one instance of parasite infection was detected in the species of cyprinidae.To our knowledge, parasites were first found as an error source that compromised attempts at the DNA barcoding of fish species.Cross-contamination was detected in all of the analyzed specimens.There are no intended fish infections might be due to contact within specimens in a single fish net, which traditional fishery inventory often used [48].The Sanger-based barcoding method could not discriminate between the cross-contaminations unless enough clones were involved before sequencing.This was laborious work, which the HTS could have easily solved.In the process of bioinformatics, no medium-high frequency unique was detected that is, no heteroplasmy was detected in all specimens.In addition, the HTS process eliminates the need for post-clone sequencing, simplifies the procedure for obtaining barcodes, and reduces the cost of library construction by at least 30%.This is the first comprehensive molecular assessment of the fish species in Tai Lake, including the most majority of known species up to now.The criteria for examining the discrimination power of DNA barcoding is based on comparing intraspecific and interspecific genetic distances [39], which were in the range of 0% to 9.35% and 0.32% to 23.9%, respectively, in this study.Here, the barcode gap analysis was 100% successful in all of the analyzed specimens.In previous studies, the species discrimination rate ranged from 88% to 95% [49][50][51][52].The inability to differentiate between some instances may be due to cryptic species, which would lead to the incongruence between barcode and morphological identification [53]; another reason may be the haplotype sharing that caused by hybridization among species [37].In any case, the 100% discrimination rate in this study demonstrated the perfect congruence between barcode sequences and adequate taxonomy.
The mean value of the intraspecific K2P distance of 0.3% (SE = 0.03) calculated here has also been shown for fish populations in the Nujiang River in Southwest China [37] and is in accordance with the value of below 1% calculated when COI was used as a barcoding maker [54,55].Of the 33 species analyzed here, some extremely intraspecific distances with the highest value of 9.35% were found in the two species Misgurnus anguillicaudatus, and Monopterus albus and displayed deep divergence in the NJ tree.
The barcode library is established for the purpose of predicting unexplored fish specimens when captured from the Tai Lake again.Some relative prediction models are designed for delimiting species based on the similarity of the barcode sequence.The "10-fold rule" that all barcode sequences differing at a species level by the 10 fold of the average value of intraspecific distance was introduced as a standard threshold [56].In this study, the average K2P distance between congeneric species was 13-fold that of the overall intraspecific distance.The threshold calculated here was lower than those reported for other fish barcoding studies, where the value ranged from 15-fold to 70-fold [38,39,51,57].
ABGD is another important prediction model for the first set of species hypotheses performed on the web interface [33].The ABGD method grouped barcode sequences into 34 OTUs.It was well known that the two species Misgurnus anguillicaudatus and Monopterus albus displayed deep divergence in the NJ tree.In contrast, the extremely closed distance between the other two species Megalobrama amblycephala and Megalobrama skolkovii led to the generation of only one OTU.This was due to the asymmetry of divergent COI sequences [56].Moreover, the difference between the two Megalobrama species is only one base of the 313 bp COI sequences.A character-based identification method performed by the BLOG model could be used to delimit them [58].Overall, when the unexplored specimen is identified, if ABGD is involved in grouping sequences into candidate species, another taxonomic approach should be complemented for the purpose of obtaining a 100% success rate in fish species identification [12,59].

Conclusions
With a strict bioinformatics process, high-throughput sequencing (HTS) can significantly improve the capture of barcode sequences of fish species.The successful rate of fish barcoding was 74% and 91% in Sanger and HTS, respectively.Cross-contamination and parasites were the primary error sources that compromised attempts at the DNA barcoding of fish species.Overall, the local species barcode library established by HTS barcoding here is anticipated to shed new light on the conservation of fish diversity in Tai Lake.

Figure 1 .
Figure 1.Map of sampling sites in this study.Location details and specimens collected per site ar provided in Supplementary Materials (TableS1).

Figure 1 .
Figure 1.Map of sampling sites in this study.Location details and specimens collected per site are provided in Supplementary Materials (TableS1).

Figure 2 .
Figure 2. Historical list of fish species in Tai Lake, and specimens captured in this study.There were 107 species, historically belonging to 1 class, 14 orders, and 25 families; cyprinidae fish accounted for more than 60 percent of the total species.In total, 180 specimens were captured, belonging to 34 species and 6 orders in this study; among these specimens, cyprinidae also was the most dominant family.

Figure 3 .
Figure 3.Comparison of barcode sequences obtained by Sanger sequencing and HTS.In horizontal barplot, the grey bar represents the number of specimens captured from Tai Lake, blue bars represent the number of specimens successful barcoded by Sanger sequencing (up) and HTS (down), and orange bars represent the number of specimens failed to barcoded by Sanger sequencing (up) and HTS (down).In histogram, each step in bioinformatics is depicted.

Figure 3 .
Figure 3.Comparison of barcode sequences obtained by Sanger sequencing and HTS.In the horizontal barplot, the grey bar represents the number of specimens captured from Tai Lake, blue bars represent the number of specimens successfully barcoded by Sanger sequencing (up) and HTS (down), and orange bars represent the number of specimens failed to barcoded by Sanger sequencing (up) and HTS (down).In histogram, each step in bioinformatics is depicted.

Figure 4 .
Figure 4.The scatter plot demonstrates the relations between number of barcode regions in ea specimen and number of error-free reads finally retained.The red circle represents the specim infected by parasites.

Figure 4 .
Figure 4.The scatter plot demonstrates the relations between number of barcode regions in each specimen and the number of error-free reads finally retained.The red circle represents the specimen infected by parasites.

Figure 6 .
Figure 6.The maximum distance within species in each specimen compared with the minimum distance between species in each specimen.All specimens fall above the 1:1 line, indicating the existence of a barcode gap.

Figure 7 .
Figure 7. Neighbor joining diagram (NJ tree) of all analyzed specimens.The calculation of measurement is based on K2P method.Outside the tree, red curve fragments represen grouped by ABGD method; all 34 red curve fragments are in accordance with their prior m logical species.

Figure 7 .
Figure 7. Neighbor joining diagram (NJ tree) of all analyzed specimens.The calculation of distance measurement is based on K2P method.Outside the tree, red curve fragments represent OTUs grouped by ABGD method; all 34 red curve fragments are in accordance with their prior morphological species.

Table 1 .
Depiction of genetic divergences sorted by taxonomic level.

Table 1 .
Depiction of genetic divergences sorted by taxonomic level.