A Systematic Study on DNA Barcoding of Medicinally Important Genus Epimedium L. (Berberidaceae)

Genus Epimedium consists of approximately 50 species in China, and more than half of them possess medicinal properties. The high similarity of species’ morphological characteristics complicates the identification accuracy, leading to potential risks in herbal efficacy and medical safety. In this study, we tested the applicability of four single loci, namely, rbcL, psbA-trnH, internal transcribed spacer (ITS), and ITS2, and their combinations as DNA barcodes to identify 37 Epimedium species on the basis of the analyses, including the success rates of PCR amplifications and sequencing, specific genetic divergence, distance-based method, and character-based method. Among them, character-based method showed the best applicability for identifying Epimedium species. As for the DNA barcodes, psbA-trnH showed the best performance among the four single loci with nine species being correctly differentiated. Moreover, psbA-trnH + ITS and psbA-trnH + ITS + rbcL exhibited the highest identification ability among all the multilocus combinations, and 17 species, of which 12 are medicinally used, could be efficiently discriminated. The DNA barcode data set developed in our study contributes valuable information to Chinese resources of Epimedium. It provides a new means for discrimination of the species within this medicinally important genus, thus guaranteeing correct and safe usage of Herba Epimedii.


Introduction
Over 200 million people suffer from osteoporosis around the world, and the prevalence of osteoporosis keeps rising with the increasing elderly population [1]. In 2005, the total fractures were more than 2 million, costing nearly $17 billion in the United States. The rapid growth in the disease burden was also projected from 2005 to 2025 [2]. For a long time, Chinese herbal medicine has been used to treat fractures and joint diseases in China. Herba Epimedii (Yinyanghuo) is one of the most widely used herbs that are prescribed in formulas to treat osteoporosis [3], and its extract can reduce the occurrence of osteoporosis both in experimental models and clinical studies [4]. The term Yinyanghuo was first listed in Shen Nong Ben Cao Jing as a middle-grade herb during 200-300 B.C., and many species in genus Epimedium L. (Berberidaceae) have been used in traditional Chinese medicine (TCM) for a long time to nourish the kidney and reinforce the Yang [5,6]. At present, five species are recorded we systematically evaluated the feasibility and identification efficiency of four generally acknowledged single loci, namely, rbcL, psbA-trnH, ITS, and ITS2, and their combinations to discriminate the Epimedium species distributed in China to identify the most suitable barcodes for genus Epimedium and provide effective information for the species identification of this genus.

Plant Materials
A total of 72 samples representing 37 Epimedium species, of which 20 are medicinally used, were collected from the Jiangxi, Guizhou, Hubei, and Jilin provinces in China. In addition to the widely distributed Epimedium species, some narrowly distributed species of the genus in China were collected as well. For instance, E. dewuense and E. shuichengense are endemic species to certain areas of Guizhou province. E. zhushanense only grows in Zhushan County in the Hubei province, and E. tianmenshanensis is merely distributed in Hunan province. Moreover

DNA Isolation, Amplification, and Sequencing
Approximately 30 mg fresh leaf from each sample was ground for 2 min at a frequency of 30 times/s in a FastPrep bead mill (Retsch MM400, Haan, Germany). The total genomic DNA was extracted using the Universal Plant Genome DNA Kit (Tiangen Biotech, Beijing, China) according to the manufacturer's instructions. Short fragments of the specific regions of the plastid (rbcL), noncoding (psbA-trnH), and nuclear DNA (ITS, including 18S, ITS1, 5.8 S, ITS2, and 28 S) sequences were amplified from the extracts. The ITS2 sequence was retrieved from the ITS region directly. Universal primers for candidate barcodes and reaction conditions were used as previously reported [35][36][37]. The primers and reaction conditions used to amplify these regions are listed in Table S2. PCR was performed in 25 µL of the reaction system, containing 2 µL of template DNA (approximately 30 ng), 12.5 µL of 2× Taq PCR Master Mix (Aidlab Biotechnologies Co., Ltd., Beijing, China), and 1 µL of each primer (2.5 µmol/L), and filled with double-distilled water. The purified PCR products were sequenced in both directions by using an ABI 3730XL sequencer (Applied Biosystems, Inc., Foster City, CA, USA) on the basis of the Sanger sequencing method by the Major Engineering laboratory of the Chinese Academy of Agricultural Sciences (Beijing, China). Altogether, 70 samples representing 36 species were sequenced successfully with all the three loci (rbcL, psbA-trnH, and ITS), and 210 new sequences were submitted to GenBank under the accession number MG837275-MG837475 and MH252065-MH252073 (Table S1). These sequences were used in the subsequent analysis.

Statistical Analysis
Consensus sequences and contig generation were performed using the CodonCode Aligner V 7.1.1 (CodonCode Co., Dedham, MA, USA). The ITS sequences were subjected to hidden Markov model analysis to obtain the complete ITS2 sequences [38]. Sequences from each DNA region were aligned using MEGA 6.0 [39]. For species with more than three individuals, the average intraspecific distance, theta, and coalescent depth were calculated to evaluate the intraspecific variation based on the Kimura two-parameter (K2P) model; the average interspecific distance, minimum interspecific distance, and theta primer were used to assess the interspecific divergence using the K2P model [40][41][42]. To assess the potential of each single locus and their combinations for accurate species discrimination, distance-based methods (TaxonDNA and neighbor-joining trees) and character-based approach (BLOG) were employed for species with more than one individual. The best match (BM) and best close match (BCM) tests in the Species Identifier 1.8 program of TaxonDNA were run [43]. The neighbor-joining (NJ) trees of each single locus and their combinations were constructed using MEGA 6.0, and bootstrap tests were performed with 1000 replicates [39]. The character-based DNA barcode method in BLOG (Barcoding with LOGic) 2.0 was applied to classify species with more than one individual [44]. Each single locus and their combinations were subjected to 100% slicing within species level with a maximum of 500 iterations (GRASPITER = 500) and a maximum time of 5 min for analysis (GRASPSECS = 300). The logic formula with the lowest false positive rate against the reference dataset was taken as discrimination basis. Moreover, species that could be identified were shown using Venn diagrams.

Efficiency of PCR Amplification and Sequencing
The three loci, rbcL, ITS, and psbA-trnH, all showed high PCR amplification and sequencing efficiency (100%). Moreover, the effective sequence ratio of rbcL and psbA-trnH were the same (100%), followed by ITS (97.2%). Detailed information regarding the PCR amplification and sequencing efficiency of the candidate barcodes is provided in Table S1.

Genetic Divergence Determination
The length of the aligned sequence (base pairs) and variable sites for the rbcL, psbA-trnH, ITS, and ITS2 regions were 703/10, 572/59, 705/45, and 247/23, respectively. Six parameters were used to characterize inter-and intraspecific divergence. Results indicated that psbA-trnH had the highest interspecific divergence, followed by ITS2 and ITS. Meanwhile, rbcL exhibited a relatively lower interspecific divergence compared with the other regions (Table S3). At the intraspecific level, rbcL showed the lowest divergence, while psbA-trnH displayed the highest variation level (Table S3).

Evaluation of the Identification Efficiency of the DNA Barcodes
In the present study, the distance-based method, namely, TaxonDNA, was used to assess the applicability of the different regions for species discrimination. Similar identification efficiency was obtained based on the BM and BCM methods (Table 1). For the single locus, psbA-trnH exhibited the highest species identification efficiency (29.62%), followed by the ITS (22.22%) and ITS2 regions (18.51%). Meanwhile, the rbcL region showed the lowest resolution rate of 3.70%. In the two loci combinations, psbA-trnH + ITS2 showed a higher resolution rate (38.88%) than psbA-trnH + ITS (37.03%). Moreover, compared with the two loci combinations, the resolution rate was not increased when three loci (psbA-trnH + ITS + rbcL and psbA-trnH + ITS2 + rbcL) were combined (37.03%). Results showed that only six species (E. koreanum, E. dewuense, E. zhushanense, E. shuichengense, E. brevicornu, and E. pseudowushanense) could be identified efficiently using single locus and their combinations ( Figure 1).
Second only to ITS region, psbA-trnH exhibited comparatively higher identification efficiency among the four single loci by using the NJ tree method (Figure 1). Three species, namely, E. koreanum, E. zhushanense, and E. shuichengense, could be identified using the locus psbA-trnH. Moreover, results suggested that the number of species that can be authenticated was increased when two or three loci were combined (Figure 1). Six species, namely, E. koreanum, E. dewuense, E. zhushanense, E. shuichengense, E. davidii, and E. brevicornu, could be discriminated by psbA-trnH + ITS and psbA-trnH + ITS + rbcL. Additionally, five Epimedium species, namely, E. koreanum, E. dewuense, E. zhushanense, E. shuichengense, and E. davidii, could be differentiated by psbA-trnH + ITS2 and psbA-trnH + ITS2 + rbcL. Detailed information regarding the NJ trees of each single locus and their combinations is provided in Figures S1  and S2.
Compared with the distance-based method, the identification efficiency of each single locus and their combinations was significantly improved by using the character-based approach (Table 1). Among the four single loci, psbA-trnH displayed the highest resolution rate (61.11%) with nine species being identified. By contrast, rbcL showed the lowest (14.81%). Moreover, combination of barcodes increased the species identification efficiency significantly. Among the four multilocus combinations, psbA-trnH + ITS and psbA-trnH + ITS + rbcL both showed the highest resolution rate (92.59%). Aside from E. acuminatum, E. leptorrhizum, and E. epsteinii, a total of 17 species, of which 12 are medicinally used, could be distinguished (Figure 1). The logic formulas to identify the 17 species by using psbA-trnH + ITS region are listed in Table 2. Among the four single loci, psbA-trnH displayed the highest resolution rate (61.11%) with nine species being identified. By contrast, rbcL showed the lowest (14.81%). Moreover, combination of barcodes increased the species identification efficiency significantly. Among the four multilocus combinations, psbA-trnH + ITS and psbA-trnH + ITS + rbcL both showed the highest resolution rate (92.59%). Aside from E. acuminatum, E. leptorrhizum, and E. epsteinii, a total of 17 species, of which 12 are medicinally used, could be distinguished (Figure 1). The logic formulas to identify the 17 species by using psbA-trnH + ITS region are listed in Table 2.  Species that can be identified based on three methods for each single locus and their combinations by using Venn diagrams. A, ITS; B, ITS2; C, rbcL; D, psbA-trnH; E, psbA-trnH + ITS2/psbA-trnH + ITS2 + rbcL; F, psbA-trnH + ITS/psbA-trnH + ITS + rbcL. "*" represents medicinally used species.  Table 2. Diagnostic formulas of psbA-trnH + ITS region generated by BLOG for species identification. "*" represents medicinally used species.

Development of DNA Barcode Resources for Genus Epimedium
In our research, four universal candidate barcodes, namely, rbcL, psbA-trnH, ITS, and ITS2, were assessed for their applicability in identifying the Epimedium species. All three regions (rbcL, psbA-trnH, and ITS) showed high rates of sequencing recovery. The psbA-trnH intergenic spacer was reported as one of the most variable plastid regions in the angiosperms [35]. In our study, it demonstrated the highest interspecific divergence between species of Epimedium. Meanwhile, it exhibited better identification capability than the other regions for Epimedium species based on both distance (TaxonDNA and NJ tree) and character (BLOG) analyses. The BLOG method displayed the highest identification ability with nine species being differentiated, whereas five and three species can be respectively discriminated by TaxonDNA and NJ tree. Thus, psbA-trnH region may be used as a potential barcode to discriminate the Epimedium species. Second only to psbA-trnH, ITS region showed good identification ability with six species being differentiated. The China Plant BOL Group suggested that ITS/ITS2 should be incorporated into the core barcode for seed plants [45]. Here, ITS2 showed lower identification efficiency than ITS based on both distance-and character-based methods, which could be attributed to lower interspecific divergence of ITS2 for Epimedium species.
Considering the limited discriminatory power of the single-locus barcode, a multigene tiered approach for barcoding plants was recommended by Newmaster et al. [46]. In this study, the combination of two and three loci showed better identification efficiency than that of the single locus. The combination of two loci psbA-trnH and ITS, as well as the combination of three loci psbA-trnH, ITS, and rbcL, showed the highest discrimination power among all the multilocus combinations. As for the analytical methods, the character-based method exhibited the highest identification efficiency with 17 species being discriminated. The character-based method (BLOG), suggested to be efficient and precise, could provide diagnostic formulas listing the species-specific nucleotides to differentiate species from others [47]. On the contrary, the distance-based method showed lower resolution rate than the character-based method, and only six species could be discriminated. The intraspecific and interspecific genetic distances overlapped in our study (Table S3), making authentication of these species difficult by using BM test, BCM test, and NJ trees. Therefore, the character-based method is a more appropriate choice to identify Epimedium species.
This study contributes with DNA barcodes for more than 70% of the species in Epimedium, covering nearly all the commonly used medicinal plant species and rare species. For each species, information of the four universal barcodes, namely, rbcL, psbA-trnH, ITS, and ITS2, were provided. The dataset developed could provide assistance for accurate species identification, sustainable recourse utilization, and new medicinal resources development of this genus.

Chloroplast Genome-Based Super Barcode Has the Potential to Solve the Problem of Species Discrimination and Classification in Genus Epimedium
A number of taxonomic controversies exist in the Chinese sect. Diphyllon [18,19,48], and efficient methods for species identification and classification are lacking. In our study, the commonly used DNA barcodes were effective in identifying 17 Epimedium species, but the phylogenetic relationships of Chinese sect. Diphyllon were poorly resolved. The unresolved phylogeny of genus Epimedium might be ascribed to the following taxonomic issues. First, some species published earlier were only based on one or two specimens or even fragmentary specimens, resulting in the inaccurate description of the morphological characteristics. Epimedium baojingensis and E. zhushanense were believed to be the only species with unifoliolate leaves when they were published. However, the investigation of Zhang et al. [48] revealed that leaves of these species are usually trifoliolate and occasionally unifoliolate. Second, using the small variations as the reference to publish new species tends to ignore the possible connection between the new species and model specimens, which may lead to unnecessary publication to some extent and even cause problems for further study. Third, incomplete breeding isolation between the species and existence of natural hybrids may also complicate the relationships of the Chinese Epimedium species, thereby increasing the difficulty of morphological identification and taxonomy research [12]. Thus, a more effective method is needed to further research the discrimination and classification of this genus. Recently, the chloroplast (cp) genomes of plants have been applied as a useful tool for phylogenetic studies and species identification [49][50][51]. The cp genome can significantly increase the resolution at low taxonomic levels in plant phylogeny; it was proposed as a species-level DNA barcode [52] and has been used as a plant barcode to discriminate closely related species [52][53][54]. The complete cp genome with more variation information than single or multiple DNA barcodes was also suggested as a super barcode [55], which may be a solution for the identification and classification problems in the Epimedium species. Zhang et al. [13] sequenced the complete cp genomes of five Epimedium species and found that the phylogenetic relationships among these cp genomes were consistent with the updated system. However, the cp genome study of 90% of the approximately 50 Epimedium species distributed in China still has not been conducted. Additionally, only one sample was sequenced for each species, which was not adequate to demonstrate the evolutionary relationships and divisions within the section Diphyllon because some Epimedium species have large intraspecific variations. Therefore, more systematic and in-depth classification and identification investigation on the basis of the complete cp genomes of a large sample size are urgently needed. Thus, the cp genome-based super barcode will be utilized to study the identification and evolution relationships of all Chinese Epimedium species in our further research. Further research is expected to solve the complex problems in taxonomy and species discrimination and to guarantee the medical safety of species in genus Epimedium.