Assessment of ITS2 Region Relevance for Taxa Discrimination and Phylogenetic Inference among Pinaceae

The internal transcribed spacer 2 (ITS2) is one of the best-known universal DNA barcode regions. This short nuclear region is commonly used not only to discriminate taxa, but also to reconstruct phylogenetic relationships. However, the efficiency of using ITS2 in these applications depends on many factors, including the family under study. Pinaceae represents the largest family of extant gymnosperms, with many species of great ecological, economic, and medical importance. Moreover, many members of this family are representatives of rare, protected, or endangered species. A simple method for the identification of Pinaceae species based on DNA is necessary for their effective protection, authentication of products containing Pinaceae representatives, or phylogenetic inference. In this study, for the first time, we conducted a comprehensive study summarizing the legitimacy of using the ITS2 region for these purposes. A total of 368 sequences representing 71 closely and distantly related taxa of the seven genera and three subfamilies of Pinaceae were characterized for genetic variability and divergence. Intra- and interspecies distances of ITS2 sequences as well as rates of sequence identification and taxa discrimination among Pinaceae at various taxonomic levels, i.e., the species complex, genus, subfamily, and family, were also determined. Our study provides a critical assessment of the suitability of the ITS2 nuclear DNA region for taxa discrimination among Pinaceae. The obtained results clearly show that its usefulness for this purpose is limited.


Introduction
The Pinaceae family is the largest extant family of all gymnosperms [1]. Within this family, 225 species have been distinguished, which are grouped in three subfamilies (Pinoideae Pilg., Laricoideae Melchior et Werdermann, Abietoideae Pilg.) and eleven genera, i.e., Abies Mill., Cathaya Chun & Kuang, Cedrus Belon ex Trew, Keteleeria Carriere., Larix Mill., Nothotsuga Hu ex C. N. Page, Picea A. Dietr., Pinus L., Pseudolarix Gordon, Pseudotsuga Carriere, and Tsuga Carriere [2]. Representatives of the Pinaceae are an extremely important component of temperate, boreal, subalpine, and subtropical forests in the northern hemisphere. Moreover, conifers from the Pinaceae family are of great economic importance, especially the genus Pinus, which has been used for many years as a valuable source of wood, resins, essential oils, and seeds [3]. Many studies also show that members of this family represent a valuable source of active substances with medicinal properties [4][5][6][7][8].
The high economic and medical value of members of this family may be a strong temptation to over-exploit particular species, obtain them illegally, or even falsify products containing them. Since many species of the Pinaceae family are rare, endangered, or protected, this can pose a serious threat. Another equally serious risk is the misidentification of certain species. There are many closely related taxa in the Pinaceae family with a very similar needle or cone morphology. Moreover, they have the same ecological niches, similar Table 1. Values of basic parameters characterizing genetic variation in ITS2 sequences in Pinaceae. AL-alignment length; CS-conserved sites; VS-variable sites; PIS-parsimony informative sites; SS-singleton sites; OMD-overall mean distance. Values in brackets are given as percentages. The overall mean distance (OMD) for the Pinaceae is 0.342. In general, the highest values of the basic parameters characterizing the genetic variation of the ITS2 sequence in Pinaceae were found at the family level, and the lowest at the species complex level.

Genetic Divergence within and between Pinaceae Taxa
MEGA version X was used to calculate the genetic divergence of ITS2 sequences in the Pinaceae family. Table 2 summarizes in detail the values of five genetic divergence parameters, i.e., all interspecific distance, minimum interspecific distances, theta, all intraspecific distance, and coalescence depth obtained in this study at the species complex, genus, subfamily, and family levels. At the P. mugo complex level, all genetic divergence indices were zero. At the genera level, the highest all interspecific and all intraspecific variation was observed in the genus Keteleeria (0.1998 ± 0.0973 and 0.1972 ± 0.1919, respectively). This genus was also characterized by the highest value of theta (0.1995 ± 0.0250) and coalescent depth (0.2958 ± 0.2874). The lowest values of the above-mentioned coefficients were observed for the genus Picea. The genus Pseudotsuga was the only one represented by only one taxon. Therefore, the value of all five coefficients for this type was zero. At the subfamily level, the lowest all interspecific and all intraspecific distances were revealed for Laricoideae (0.0436 ± 0.0083 and 0.0063 ± 0.0011). In turn, the highest all interspecific distance was found in Pinoideae (0.1702 ± 0.0059), and the highest all intraspecific distance in Abietoideae (0.0370 ± 0.0291). For all analyzed subfamilies, the minimum interspecific distance was equal to 0.0000. The theta value ranged from 0.0222 for Laricoideae to 0.1492 for Pinoideae. The highest value of coalescent depth was observed for Abietoideae (0.0562) while the lowest (0.0249) for Laricoideae. At the family level, all interspecific and all intraspecific distances parameters were 0.3586 and 0.0294, respectively. The theta value was equal 0.3417 and coalescent depth value was 0.0423.
The pairwise genetic distance-based method was used to determine the genetic divergence of the analyzed ITS2 sequences. Figure 1 shows the relative distribution of all intraspecific and all interspecific distances obtained for the analyzed samples from the Pinaceae family. The value of intraspecific distance in the range from 0.0% to 1.0% constituted 46.04% of the total observation, while the values of interspecific distance in the range of 0.0 to 1.0 and 1.0 to 2.0 were respectively 14.74% and 14.39%. Overall, the vast majority of intraspecific distance values (95.06%) are in the range from 0% to 8%, with 83.93% in the range from 0.0% to 2.0%. The interspecific value of the distance in the range from 0% to 14% constitutes 90.49% of the calculations, and the range of distance from 0% to 6% constitutes 67.75% (Table S4).
to 14% constitutes 90.49% of the calculations, and the range of distance from 0% to 6% constitutes 67.75% (Table S4).

Rates of Sequences Identification and Taxa Discrimination Among Pinaceae
BLAST1 and TAXONDNA/Species Identifier 1.8 were used to determine the percentage of correct, ambiguous, and incorrect ITS2 sequence identification and taxa discrimination among Pinaceae.
Using the BLAST1 method, the correct ITS2 sequence identification rate was the highest for the Tsuga genus (100%) and the lowest for Pseudotsuga (0%). More than 40% of ITS2 sequences were correctly identified in the genus Abies and just over 30% in the genus Pinus. In contrast, in the genus Picea, only 8.33% of the ITS2 sequences were correctly identified. Ambiguous ITS2 sequences identification was highest for Pseudotsuga (100%) and lowest (0%) for Tsuga. Generally, for six out of seven analyzed Pinaceae genera, the value of ambiguous identification rate was more than 50%. Incorrect identification rates were highest for Picea (25%), followed by Pinus (13.51%) and Abies (4.76%). Overall success in correct species discrimination based on the BLAST1 method and ITS2 sequences was moderate (32.88%). Figure 2 and Table S5 summarize the results obtained with BLAST1 method.

Rates of Sequences Identification and Taxa Discrimination among Pinaceae
BLAST1 and TAXONDNA/Species Identifier 1.8 were used to determine the percentage of correct, ambiguous, and incorrect ITS2 sequence identification and taxa discrimination among Pinaceae.
Using the BLAST1 method, the correct ITS2 sequence identification rate was the highest for the Tsuga genus (100%) and the lowest for Pseudotsuga (0%). More than 40% of ITS2 sequences were correctly identified in the genus Abies and just over 30% in the genus Pinus. In contrast, in the genus Picea, only 8.33% of the ITS2 sequences were correctly identified. Ambiguous ITS2 sequences identification was highest for Pseudotsuga (100%) and lowest (0%) for Tsuga. Generally, for six out of seven analyzed Pinaceae genera, the value of ambiguous identification rate was more than 50%. Incorrect identification rates were highest for Picea (25%), followed by Pinus (13.51%) and Abies (4.76%). Overall success in correct species discrimination based on the BLAST1 method and ITS2 sequences was moderate (32.88%). Figure 2 and Table S5 summarize the results obtained with BLAST1 method. to 14% constitutes 90.49% of the calculations, and the range of distance from 0% to 6% constitutes 67.75% (Table S4).

Rates of Sequences Identification and Taxa Discrimination Among Pinaceae
BLAST1 and TAXONDNA/Species Identifier 1.8 were used to determine the percentage of correct, ambiguous, and incorrect ITS2 sequence identification and taxa discrimination among Pinaceae.
Using the BLAST1 method, the correct ITS2 sequence identification rate was the highest for the Tsuga genus (100%) and the lowest for Pseudotsuga (0%). More than 40% of ITS2 sequences were correctly identified in the genus Abies and just over 30% in the genus Pinus. In contrast, in the genus Picea, only 8.33% of the ITS2 sequences were correctly identified. Ambiguous ITS2 sequences identification was highest for Pseudotsuga (100%) and lowest (0%) for Tsuga. Generally, for six out of seven analyzed Pinaceae genera, the value of ambiguous identification rate was more than 50%. Incorrect identification rates were highest for Picea (25%), followed by Pinus (13.51%) and Abies (4.76%). Overall success in correct species discrimination based on the BLAST1 method and ITS2 sequences was moderate (32.88%). Figure 2 and Table S5 summarize the results obtained with BLAST1 method.  identification at the species complex, genus, subfamily, and family level for taxa represented in the analysis by at least two sequences. The summary of the success of taxa identification for analyzed genera among the Pinaceae family using the TaxonDNA method is shown in Table 3. At the species complex level, represented by the closely related European mountain pine complex, Pinus mugo complex (PMC), 100% of the analyzed ITS2 sequences were ambiguously identified using the BM and BCM options. None of the ITS2 sequence in this species complex were identified as correct (0%) or incorrect (0%). At the genera level, Pseudotsuga and Tsuga were the most successfully discriminated (100%), while Keteleeria had the lowest discriminatory success. At the subfamily level, the highest success of identification was noticed for Laricoideae (23.48%), and the lowest success for the subfamily Abietoideae (17.80%). At the family level, the success rate of species discrimination among the Pinaceae was relatively low. Only 71 of 368 sequences were correctly identified to species level (19.2%) based on best match (BM) analysis. It is worth emphasizing that the ambiguous identification was over three times higher than the correct identification (74.1%). Incorrect identification concerned 7.5% of the sequences. Similar results were obtained based on the best close match (BCM) analysis.

Phylogenetic Inference
Phylogenetic inference was performed in the RaxML v8.2.11 [33] using 371 ITS2 sequences and the maximum likelihood (ML) method. The resulting phylogenetic tree sorted all analyzed Pinaceae taxa into the appropriate genera according to the commonly accepted taxonomy, most with high bootstrap support. Although the ITS2 sequences allowed the assignment of taxa to the different Pinaceae genera, their distinction within these genera was not so obvious. These observations are fully consistent with the results obtained by us with BLAST1 and Taxon DNA in terms of the correct sequence identification and discrimination of Pinaceae taxa, where the species level also turned out to be the most critical.
The overall topology of the obtained phylogenetic ML tree based on ITS2 sequences is consistent with the commonly accepted division of Pinaceae and strongly supports the concept of monophyly of particular Pinaceae genera ( Figure S1).

Discussion
The main aim of the article was to assess the suitability of the ITS2 region for the discrimination and identification of taxa in Pinaceae and phylogenetic inference in this large family of conifers. Detailed analysis of the genetic variation of the ITS2 sequence was performed for the first time in one comprehensive study. Particular attention has been paid to precisely characterize the level of genetic variation in the ITS2 sequence at four different taxonomic levels and to determine the success of sequence and taxa discrimination using both phylogenetically distant and closely related conifers.
In contrast to some reports in the literature [26,27] our observations clearly showed that although ITS2 is easy to analyze, it has severe limitations (mainly low level of genetic variability) and is not the best choice for identification, authentication, or phylogenetic inference in Pinaceae. The finding was quite unexpected as there are quite a few ITS2 sequences in the databases. Moreover, many studies show that ITS2 is a very good DNA barcode region [22,23,25,34,35]. According to China Plant BOL Group [36], the nuclear ribosomal DNA region, ITS, is characterized by a much better ability to discriminate species compared to plastid regions. Internal transcribed spacer (ITS1, 5.8S, ITS2) is characterized by a higher rate of evolution than the coding regions, is inherited biparentally, and shows a high level of divergence at the species level, which allows it to be used to identify even closely related species [37]. This is one of the reasons why the assessment of the ability of ITS2 to discriminate between closely related taxa in the Pinaceae family was one of the goals of our work.
In seed plants, the length of ITS varies usually from 500 to 700 bp [38], while in gymnosperms it is much longer, especially in Pinaceae (1500-3700 bp) [37,39], except that half the length of ITS1 consists of subrepeats [40][41][42][43], which results in some difficulty in PCR amplification and typical Sanger sequencing reads. Hence, the availability of complete ITS sequences in genetic databases for different species is still very limited. Therefore, China Plant BOL Group has suggested using only the ITS2 sequence as an alternative to the complete ITS region, due to the corresponding variability of the primary and secondary structure [23,27,44].
There are many reports in the literature on the assessment of the identification potential of various genomic regions postulated for distinguishing and analyzing different groups of organisms that would be ideal DNA barcodes. For animals, the cytochrome c oxidase 1 subunit 1 has been proposed as the main bar code [16]. In the case of plant identification, this region will not work well due to the insufficient rate of nucleotide substitution in the plant's mitochondrial genome [45]. To solve this problem, several promising highly variable plastid DNA regions have been proposed, including both coding and non-coding loci, which can be used singly or together in various combinations. In this way, among others, rbcL [45][46][47], matK [48][49][50][51][52], combination rbcL + matK [53], intergenic spacer region-trnH-psbA [46,49,50,[54][55][56][57], rpoB [45,49,51], rpoC1 [49,51], atpF-atpH [45], ndhJ, ycf 5, or accD [51], were selected as valuable. Recently, the ycf 1 region was indicated as very promising due to its very high variability, and even recommended as the main barcode in terrestrial plants [58]. Although the usefulness of ycf 1 as a species-diagnostic marker has been demonstrated in the case of the Pinaceae family [59], it should be used in the analyses with caution as it was confirmed that this region is under positive selection in all Pinus plastomes [60], as well as frequent hybridizations within the genus Pinus, and finally the complex model of inheritance of the chloroplast genome [61]. Some difficulty in the widespread use of the complete ycf 1 region may also arise from its considerable length, which makes it difficult to apply conventional Sanger sequencing with readings of 600-800 bp.
In this respect, nuclear markers, especially short ITS2, seem to be a better solution than chloroplast regions, especially given that the ITS2 region has so far been used quite successfully for the discrimination and identification of taxa in many angiosperm families. ITS2 turned out to be the best marker differentiating species from the Araliaceae family [24]. Several universal and popular DNA barcode regions (matK, rbcL, ITS2, psbA-trnH, and ycf 5) were assessed by Liu [24] for their ability to identify 1113 sequences derived from 276 species from 23 genera. ITS2 correctly identified 85.23% and 97.29% of the sequences at the species and genus levels, respectively. Additionally, it was suggested that the psbA-trnH region could be an additional candidate DNA barcode for the identification of the Araliaceae family. Similarly, the high efficiency of the ITS2 region in discriminating taxa (>90% and 100% at the species and genus levels, respectively) was demonstrated in the Euphorbiaceae family based on the analysis of 1183 samples representing 871 species and 66 genera [34]. Similar results were obtained in the study of the Rutaceae family [62], where ITS2 was shown to be superior to the other six barcodes (psbA-trnH, matK, ycf 5, rpoC1, rbcL, ITS). It was characterized by the highest interspecific divergence with regard to intraspecific divergence. Moreover, it has also been proven to be effective in distinguishing between closely related species. However, in our research, the effectiveness of the ITS2 sequence is zero at the P. mugo complex level. Our research clearly shows that the higher the taxonomic level, the higher the percentage of success in discriminating Pinaceae taxa. However, it seems that this percentage is decreasing not only with decreasing phylogenetic distance of analyzed samples, but also with increasing number of available sequences and individuals in the database.
ITS2 was also used with greater or lesser success in the analysis of the Apocynaceae family [22] or Asteraceae [35]. In the latter family, ITS2 was considered as a suitable, but not ideal, barcode for identifying species of high medical importance, belonging to the largest family of flowering plants. Compared to other barcodes, ITS2 was characterized by the greatest universality, specific divergence, and discrimination, which makes it a promising marker for Asteraceae authentication. In the Fabaceae family, ITS2 has been shown to be effective in identifying medicinal plants. It is worth noting that ITS2 also turned out to be an appropriate phylogenetic marker, but it did not solve all taxonomic problems [23]. This suggests that effective DNA barcoding using ITS2 varies from family to family and can be unreliable in complex taxonomic groups. In the Brassicaceae and Roasaceae families, ITS2 proved to be an imperfect barcode due to its low resolution [63,64].
As generally accepted, the ideal DNA barcode region should have a higher interspecies than intraspecies variation. In the case of the current study based on the pairwise genetic distance-based method (PWG-distance) for 368 ITS2 sequences from the Pinaceae family, the estimated inter-specific divergence parameter is higher than the intra-specific distances for all analyzed groups except for the Pinus genus. The DNA sequence similarity-based method, on the other hand, showed that in the case of the Pinaceae family, the success of taxa discrimination using ITS2 was relatively low. This shows that internal transcribed spacer 2 has some problems in identifying both subspecies and varieties, so closely related species are not properly distinguished.
Our results are fully consistent with those obtained for other gymnosperm families, including Podocarpaceae, the second largest family among gymnosperms, in which ITS2 was not characterized by too high an index of taxa discrimination [20]. Moreover, Yao's [26] studies on several families of gymnosperms showed that ITS 2 had the least discriminatory success at the species level in comparison to other plant groups (mosses, ferns, monocotyledons, dicotyledons) as well as animals.

Sampling and Plant Materials
In our study, a total of 371 ITS2 sequences were analyzed, of which 368 sequences belonged to 71 Pinaceae taxa of 7 genera, and three Podocarpus sp. Sequences were outgroups only for phylogenetic analyzes (Tables S1 and S2). Of the 368 ITS2 Pinaceae sequences, 346 samples were downloaded from GenBank, and 22 sequences were obtained in this study by analyzing 20 individuals belonging to the Pinus mugo complex and 2 individuals of Scots pine (Pinus sylvestris) as the closest related taxa to the Pinus mugo complex. The analyzed specimens were collected in Poland, the Czech Republic and Germany (Table S3). The suitability of the ITS2 region to discriminate against taxa of the Pinaceae family was assessed by analyzing both phylogenetically distant and closely related taxa.

DNA Extraction and Next-Generation Sequencing
The 100 mg of fresh plant tissue was used to extract genomic DNA using the Genomic Mini AX Plant Spin kit (A&A Biotechnology, Gdańsk, Poland). The quality and . The reads generated from nextgeneration sequencing were assembled and annotated using the Geneious Prime 2020.2.5 package. Reads were mapped to MT735327 sequence from genbank using Genious mapper with default settings and minimum mapping quality 30%. Sequences that were mapped were subsequently assembled de novo using Geneious algorithm on default settings and annotated based on MT735327 sequence. The obtained 22 complete ITS2 sequences were submitted to GenBank. Each sequence was assigned an accession number (Table S2).

Data Validation
The sequences of internal transcribed spacer 2 from GenBank were downloaded using query "internal transcribed spacer 2 Pinaceae" (in July 2021). The nuclear ribosomal ITS2 sequences with ambiguous bases were discarded from further analyzes. In order to extract ITS2 sequence, the ITS2 database (Internal Transcribed Spacer 2 Ribosomal DNA Database) (version 3.0.13) available online (http://its2.bioapps.biozentrum.uni-wuerzburg. de (accessed on 20 August 2021) was used [65][66][67]. Single sequence species were excluded from the analysis.

Data Analysis
In order to carry out a deeper and more precise genetic characterization, i four taxonomic levels were considered, namely species complex, genus, subfamily, and family. ITS2 sequences were aligned using the ClustalW with default parameters available in MEGA version X [68]. Then, the length of the alignment was estimated and the percentage of conserved sites (CS), variable sites (VS), parsimony-informative sites (PIS), singleton sites (SS), and overall mean distance (OMD) were calculated. To assess the suitability of the ITS2 sequence as a potential barcode at the level of genus, subfamily, and family of Pinaceae, selected methods were used, namely the pairwise genetic distance-based method (PWG-distance), the DNA sequence similarity-based method (TaxonDNA, BLAST), and phylogenetic tree method (maximum likelihood).
The pairwise genetic distance-based method was used to determine the genetic divergence of the obtained ITS2 sequences. The five parameters were calculated in MEGA version X [68] using the Kimura two-parameter distance model (K2P) [69] with pairwise deletion option to define the interspecific and intraspecific variability. The interspecific divergence has been characterized by all interspecific distance and minimum interspecific distance parameters [27,70,71]. Intraspecific variability was calculated using the K2P distance matrix by applying three parameters: all intraspecific distance, theta(θ) and coalescent depth [50,70]. Using the "Pairwise summary" option based on the K2P model available in the TaxonDNA/SpeciesIdentifier 1.8 software, the frequency of the distribution of interspecific distance and intraspecific variability was obtained. Plots were made to illustrate the barcode gap for each genus and for the entire Pinaceae family.
Sequence identification and taxa discrimination rates among Pinaceae were calculated using the two different methods. The first was the method based on DNA sequence similarity-based method implemented in TAXONDNA/Species Identifier 1.8 program. "Best Match" (BM) and "Best Close Match" (BCM) options were used to verify the percentage of correct ITS2 sequence identification at the species complex, genus, subfamily and family level for taxa represented in at least two sequences [72]. The second was BLAST1. In this method, performed with the BLAST program (http://blast.ncbi.nlm.nih.gov/Blast.cgi, accessed on 15 February 2022), all ITS2 Pinaceae sequences were used as query sequences to search the reference database. Correct identification means that the best BLAST hits of the query sequence are from the expected species, while ambiguous identification means that the best BLAST hits for the query sequence turned out to be those of several species, including the expected species. Incorrect identification in turn means that the query sequence's best BLAST hit is not from the expected species.

Conclusions
Our study provides a critical assessment of nuclear DNA ITS2 region relevance for taxa discrimination among Pinaceae. Based on the results obtained, it can be concluded that although ITS2 fulfills some of the important features of an ideal DNA barcode region, its usefulness for distinguishing taxa among Pinaceae is severely limited. It seems that the correct and successful identification of taxa is reserved only for those that are phylogenetically distant and represent different genera rather than species, and for those for which few sequences are available in genetic databases. The closely related conifers at the species complex level are indistinguishable using ITS2. The application of this region for study relationships in the Pinaceae family does not seem justified due to its low genetic variability, which results low phylogenetic resolution. Nevertheless, further research on ITS2 is needed, especially in terms of extending the available genetic databases with new records, especially for species from the genera Tsuga or Pseudotsuga. A wider database would determine whether the high success in distinguishing taxa in these two genera is due to the low number of deposited samples in these databases or to the high efficiency of ITS2. Another interesting direction for further research could also be to determine the effectiveness of complete internal transcribed spacer 1 (ITS1) in discriminating Pinaceae taxa and studying the phylogeny of this large and important family of conifers.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/plants11081078/s1, Figure S1: Evolutionary relationships analyzed 371 ITS2 sequences, Table S1: Summary of the Pinaceae samples analyzed in this study. Table S2: List of taxa and records with accession numbers analyzed in this study, Table S3: List of individuals representing Pinus mugo complex taxa sequenced in this study, Table S4: Abundance of all intraspecific and all interspecific K2P pairwise distance in Pinaceae, Table S5: The results of species discrimination in seven genera of the Pinaceae family based on the BLAST1 method.
Author Contributions: J.S. and K.C. conceived of and designed the research framework; J.S. performed the experiments; J.S. and K.C. wrote the original draft manuscript, reviewed and edited the final manuscript; J.S. and H.F. performed data analysis; K.C. participated in the data analysis; J.S. and K.C. collected the samples; K.C. supervised the project. All authors have read and agreed to the published version of the manuscript.