DNA Barcoding Medicinal Plant Species from Indonesia

Over the past decade, plant DNA barcoding has emerged as a scientific breakthrough and is often used to help with species identification or as a taxonomical tool. DNA barcoding is very important in medicinal plant use, not only for identification purposes but also for the authentication of medicinal products. Here, a total of 61 Indonesian medicinal plant species from 30 families and a pair of ITS2, matK, rbcL, and trnL primers were used for a DNA barcoding study consisting of molecular and sequence analyses. This study aimed to analyze how the four identified DNA barcoding regions (ITS2, matK, rbcL, and trnL) aid identification and conservation and to investigate their effectiveness for DNA barcoding for the studied species. This study resulted in 212 DNA barcoding sequences and identified new ones for the studied medicinal plant species. Though there is no ideal or perfect region for DNA barcoding of the target species, we recommend matK as the main region for Indonesian medicinal plant identification, with ITS2 and rbcL as alternative or complementary regions. These findings will be useful for forensic studies that support the conservation of medicinal plants and their national and global use.


Introduction
Plant identification has formerly been formed using morphological characteristics that could be observed visually. Currently, DNA is also used to help species identification and to build bioinventories [1]. DNA barcoding was introduced by Hebert and colleagues in 2003 and involves the identification of species through universal, short, and standardized DNA regions [2]. DNA material for the barcoding can be obtained from living plants, herbarium specimens [3], and market products [4,5].
In plants, plastid DNA (rbcL, matK, trnL, and trnH-psbA regions) and nuclear DNA (ITS and ITS2 regions) are often used in DNA barcoding [6][7][8]. The rbcL and matK regions are recommended by the Consortium for the Barcode of Life (CBOL) as a standard two-locus barcode for global plant databases because of their species discrimination ability [8].
The process entails registering the DNA of identified species into a barcoding library and matching the DNA of unidentified species against the genetic data available in the library [6,9]. The library or the database can be accessed online for species identification and taxonomic clarification [10], namely through the NCBI GenBank (https://www.ncbi. nlm.nih.gov; accessed on 1 February 2022) [10] and the Barcode of Life Data (BOLD) (http://www.boldsystems.org; accessed on 1 February 2022) [11].
DNA barcoding has become an important taxonomic tool because of its accuracy, repeatability, and rapidity. It can also be used to identify species under legislative protection, or under threat of extinction, and to check the authenticity of biological products [6,9]. It is particularly powerful as identification is not influenced by the morphological diversity of species, growth phases, and environmental factors [12][13][14][15]. In the forensic field, even an inexperienced user is able to assign a taxonomic name to an unidentified plant specimen

Understanding the Use of DNA Barcoding for Indonesian Medicinal Plants
Of the 61 sampled Indonesian medicinal plants, 55 species are native to Indonesia (of which 29 are endemics), and six are introduced [34]. Some of the medicinal plants may need to be prioritized in terms of conservation, namely those assessed as threatened (VU, EN, CR) or near threatened (NT) according to the IUCN Red List [35], the 19 species listed in the CITES Appendices I, II, or III (UNEP-WCMC database) [36], and the 12 rare medicinal plants [37]. Two species were assessed as VU, namely Aquilaria hirta Ridl. [38] and Etlingera solaris (Blume) R.M.Sm. [39] and are considered to be facing a high extinction risk in the wild in the near future [40]. The 19 species listed in CITES II may become extinct if their trade is not controlled because they are collected from the wild and there is no sufficient data with respect to artificial propagation for commercial purposes [36]. Of the 61 sequence target species, 13 sequences were not found in BOLD, although their DNA sequence data was available in NCBI; a further 10 species did not have DNA sequences stored in either NCBI or BOLD. Detailed information for each of the 61 species is presented in Table 1. The contribution of the DNA barcoding information from each species to DNA banks and to the correct identification of medicinal plants with high conservation status was classified using categories A-M, where category A denotes the contribution of new data to DNA banks and DNA barcoding information that can strongly assist MP conservation; at the opposite end of the spectrum, letter M denotes the least substantial contribution, where DNA barcoding needs to be clarified further before using it directly for identification. Figure 1 indicates how the four DNA barcodes are useful for the conservation and use of Indonesian medicinal plants with respect to the availability of their data in the DNA bank. The number of medicinal plant species per criteria are provided in Table A1. Sequences grouped in categories A-D can be of direct use to conservation efforts due to the correct identification of related medicinal plants. The A-B categories can be used in botanic forensics (in cases of medicinal plant adulteration and illegal trading) for medicinal plant identification [10,[21][22][23][24], as the plants are listed as species that need to be prioritized in terms of conservation.There are 19 families of Indonesian medicinal plants consisting of 31 species, that could be identified accurately to the family level by DNA barcoding. Two major families of Indonesian medicinal plants that were successfully sequenced and correctly identified were Orchidaceae (13 sequences) and Apocynaceae (10 sequences). It is highlighted that correct identification was defined after the validation step, which is cross-checked to morphological identification result by taxonomists (indicated in the species identity card).

Understanding The Effectiveness of Each DNA Barcoding Region Used for Indonesian Medicinal Plants Identification
A total of 61 studied species were analyzed for DNA barcoding of four regions (ITS2, matK, rbcL, and trnL). There were some failures in DNA amplification and sequencing assembly, with the results of each step presented in Table 2. The sequence quality is based on the easily done assembly of both the forward and reverse regions into a single consensus sequence (Table 2). When both forward and reverse sequences were available and were of good quality, obtaining the assembled consensus sequence was straightforward. If one direction of the sequence was mixed, then no assembly could occur and only the unidirectional sequence could be used. The matK region, which is known to have the lowest amplification success among the regions used for DNA barcoding [3,41], could only be amplified in 72% samples, compared with successful amplification in 83-98% samples for the other regions (Table 2). This is consistent with previous work indicating matK has a lower PCR success rate than rbcL for DNA amplification of Indonesian plants [42]. The PCR amplification failure likely occurred due to a high level of sequence variation within the matK regions complementary to the primers [43].
There were only 212 sequences of ITS2, matK, rbcL, and trnL obtained from 61 Indonesian medicinal plants instead of the expected 244 sequences resulting from the sequencing (Table A2). Each species was annotated with its key information, such as whether it is native, how the species became important to be conserved, and all generated sequences from ITS2, matK, rbcL, and trnL regions with identification results from BLAST, whether correct, ambiguous, correct at genus or family level, or incorrect.

Description of ITS2, matK, rbcL, and trnL Regions of Indonesian Medicinal Plants
The descriptive statistics of the sequence regions ITS2, matK, rbcL, and trnL are portrayed in Figure 2. The minimum and maximum lengths (bp) of ITS2, matK, rbcL, and trnL regions varied between 233-984, 384-1142, 382-1122, and 416-962, respectively, for all studied species; the average lengths of each region were 591. 2  The relationships between MP species identification accuracy and sequence length (bp), GC content (%), species number per genus, and percentage of identity are presented in Figure 3. With respect to sequence length, the longer the ITS2 and rbcL sequence regions, the lower the identification accuracy, while the other regions indicated no such relationship. With respect to GC content (%), all regions except ITS2 tended to be less accurate for identification when the GC content increased. In terms of species number per genus, matK, rbcL, and trnL regions all tended to have no correlation with the species number per genus, but the ITS2 sequence region was more accurate in identification when the species number per genus was higher. However, this result will depend on the available DNA information in the data bank. All regions indicated a positive relationship of percentage identity (through a BLASTN search) with identification accuracy. Among the sequence regions produced for Indonesian medicinal plants, ITS2 generally had the lowest minimum length, smallest average sequence, and highest GC content (Figures 1 and 2) and hence gives the highest efficiency of identification, with only a short DNA sequence needed for correct identification. After ITS2, matK follows second with respect to having the smallest average sequence length. A short DNA sequence may make the process of DNA barcoding technically easier and more economical from extraction to sequencing, as Kress and colleagues suggested [44]. Meanwhile, in terms of GC content (%), only ITS2 had higher identification accuracy when the GC content increased. In some plant DNA sequences, GC content has a positive correlation with exon sites, i.e., the coding regions [45]. This might mean that the longer the exons, the higher the GC content; thus, DNA regions with high GC content are expected to have more accurate identification.

Identification of Indonesian Medicinal Plants Using Sequences of Their ITS2, matK, rbcL, and trnL Regions
Identification of the sequence regions resulting from the BLAST method that have been matched with samples morphologically identified are presented in Table 3. The highest correct identification in the set of medicinal plant species was reached by the matK region, followed by ITS2 and rbcL, although the percentage values among them were not significantly different (i.e., 31.15% compared to 29.51%). In contrast, trnL had the lowest correct identification, approximately 15% lower than that of matK. The highest incorrect identification was reached by the ITS2 region, followed by the rbcL region. Overall, the most accurate of the four regions was matK because it has the highest identification rate at the species level, lowest at the family level, and resulted in no incorrect identifications [3,41,42]. Some ambiguous (correct at the genus and family level) and incorrect identification of Indonesian medicinal plants occurred. This might have happened because the world plant data has more than 1.2 million species names [34], while the DNA barcoding data for plants contains only 234,692 barcodes and only 5942 plants are recorded from Indonesia (http://www.boldsystems.org; accessed on 6 February 2020). As such, the available DNA bank to be cross-checked with the samples is far from complete.
The correct identification of unique species by singular regions and by combinations of regions can be visualized in the Venn diagrams ( Figure 4). ITS2 was the most accurate region with unique correct identification, followed by rbcL, matK, and trnL. A combination of three regions gave the same number of unique correct identifications, and a combination of all gave the highest correct identification. With respect to unique correct identification at the genus level, rbcL gave the most accurate identification, followed by ITS2, trnL, and finally matK. A combination of matK, rbcL, and trnL gave the best unique accurate identification compared to the other three combinations, and the combination of all gave the largest number of unique species among all possibilities. The highest unique correct species at the family level were obtained by using rbcL, then ITS2, and finally trnL. As presented in Table 4, the overall averages of the barcoding regions describing the genetic distance between the two compared species were very similar to one another, i.e., above 1.1% and below 1.2%, except for ITS2, which indicated an average of 1.29%. The lower the taxon unit relation, the lower the percentage, while the higher the taxon unit relation, the higher the percentage. Only the minimum distance of the matK region could describe species in the same genera. Nevertheless, the maximum distance of each region describes the highest level of the different species in a family. In principle, the genetic distance of interspecific related species (within the genus level and above) should be greater than that of the intraspecific species (within species level). It can be stated that more genetic distance lies between two different species with a different family than two different species with the same family. Species within the same genus have the least genetic distance. The percentage of the sequences identified for each of the regions (ITS2, matK, rbcL, and trnL) was directly proportional to the accuracy of the identification. The higher the percentage, the more accurate the identification. MatK could correctly lead to identification of species with the highest percentages, followed by rbcL and ITS2 (Table 2). Only the matK region could differentiate species at the same genus level and species in different families compared to other regions. In contrast, ITS2 could not differentiate all species distances appropriately (Table 4).
In addition, it should be considered that using BLAST in a DNA barcoding study with any regions/primers is a basic step in identifying species [25][26][27][28]42]. BLAST analysis is the approach to the most similar species, and it depends on the species information stored in DNA bank. Therefore, the validation step to confirm the most accurate or most possible species is still required. When the used samples were clear species [42] like in this study, morphological identification by the experts was used, but when the samples were unable to be identified morphologically due to damage or derivate form or/and lack of taxonomic expert, generating a phylogenetic tree amongst medicinal plant groups such as a neighbor-joining (NJ) tree [23,25,26,42], maximum parsimony (MP), and maximum likelihood (ML) [42], and even analyzing chemical compound products [24] might be needed.
Considering Hollingsworth and colleagues' findings with respect to DNA barcoding, it could serve two purposes. The first would be to provide information into the species-level taxon unit, and the second would be to help identify an unknown specimen to a known species. Thus, all the regions tested are valuable, depending on the purpose [43]. We emphasize that having a phylogenetic tree in the barcoding study would be beneficial, particularly when experts assume the medicinal plants are unidentified or a cryptic species. Thus, identification, authentication, and even conservation plan and action can be properly defined and applied.

Plant Samples and Literature Survey
This study used 61 different species of medicinal plants belonging to 30 families and 50 genera (Table 1). Plant samples were collected from a living collection with written permission from botanic gardens, including Bogor Botanic Gardens and Cibodas Botanic Gardens in Indonesia, and Hortus Botanicus Leiden in the Netherlands. All species had been taxonomically identified using morphological features as viewed on their identity card. Their scientific names were cross-checked online using POWO (2022) [34]. A leaf sample was collected from each species, except for Alstonia scholaris (L.) R. Br. and Spondias malayana Kosterm, from which bark samples were taken. This was due to A. scholaris and S. malayana Kosterm being high trees with unreachable leaves. Each sample (approximately 25 g) was collected and stored in a teabag with silica gel [46][47][48].
A literature study was conducted to collect all scientific information with respect to each of the sampled plant species, which can help the conservation status of every species. Information about available DNA data-i.e., whether the species already had DNA barcoding or genetic information that could be accessed from DNA banks-was identified using BOLD [11] and NCBI [10]. Data on species origin, including whether the species are native or introduced to Indonesia, and, if native, whether they are endemic, were collected from POWO (http://www.plantsoftheworldonline.org; accessed on 1 February 2022) [34]. Threatened species status was collected from the IUCN Red List of Threatened Species (https://www.iucnredlist.org; accessed on 6 February 2022), with species classified as Vulnerable (VU), Endangered (EN), Critically Endangered (CR), Extinct in The Wild (EW), or Extinct (EX) [35]. Global legislation regulating trade, i.e., based on whether the species is included in CITES Appendices I, II, or III, was collected from the UNEP-WCMC Checklist of CITES species (https://checklist.cites.org; accessed on 1 February 2022) [36]. The information on rare medicinal plants, was compiled from the Indonesian Biodiversity Strategy and Action Plan (IBSAP) National Document [37]. Endemic plants or plants mentioned in the IUCN Red List, CITES Appendices I, II, or III, endemic, and priority lists were considered to be important species that need to be prioritized for conservation [49].

Molecular Analysis
Molecular analysis was performed at the University of Guelph laboratory, Canada. The barcoding method involves genomic DNA extraction, DNA amplification, and DNA sequencing, and taxonomic identification against available DNA banks. For DNA extraction, genomic DNA was extracted from plant samples using the Maxwell ® RSC Purefood GMO and Authentication Kit and the Maxwell ® RSC Instrument (Promega). For DNA amplification, primers targeting the ITS2, matK, rbcL, and trnL genes of plants were used to amplify the DNA (Table 5). Each PCR reaction mix (25 µL) contained 1x HotStarTaq master mix (Qiagen), 0.4 µM of each (forward and reverse) primers, 0.15 µg of BSA and 2 µL of template DNA. PCR thermal cycling was conducted by using a GeneAmpTM PCR System 9700 (Applied Biosystems, Waltham, MA, USA). The PCR cycling conditions were as follows: 95 • C for 10 min for DNA denaturation, 45 cycles of 95 • C for 15 sec for DNA annealing with the primer, followed by 55 • C for 30 sec and 72 • C for 1 min for DNA extension, and finally 72 • C for 7 min.
PCR products were visualized on 2% agarose gels to check whether DNA amplification was successful. PCR products were then purified using a NucleoFast ® 96 PCR clean-up kit (Macherey-Nagel). The purified PCR fragments were sequenced bidirectionally, using the same primers as for the PCR, with the help of an ABI 3730 Genetic Analyzer (Applied Biosystems). The retrieved sequences were analyzed using ABI Prism TM Sequencing Analysis software (Applied Biosystems) to obtain a consensus sequence (Q > 20) for each sample.

Sequence Analyses and Data Interpretation
For each sample, the consensus sequence was compared with the nucleotide sequences in the BOLD species ID engine and the NCBI GenBank using BLASTN (https://blast.ncbi. nlm.nih.gov; accessed on 7 January 2022) [52] with the program selection as "Highly Similar Sequences (Megablast)" [53] for taxonomic identification. When no result was obtained from Megablast due to the sequence being too short, the sequence was queried with the program selection as, "Somewhat similar sequences (nBlast) for an alternative".
PCR amplification, sequencing, and identification success rates were calculated as percentages. Only one best-matched species was selected from the BLASTN identification that is approached from the most similar sequence species recorded in DNA bank. Where there was more than a single match, the best-matched species was selected as the one with the lowest E value and the highest coverage; otherwise, any species was the closest-related species to the query (species). The results were then validated with studied medicinal species' ID from botanical gardens where they have been morphologically identified by taxonomic expert.
The BLAST identification results were the initial step to identify species with DNA barcoding [25][26][27][28]42]. It was considered to be the correct species if the highest percentage of identification referred to the right species, i.e., when the species name from sequence identification matched the morphologically identified species. Otherwise, when the sequence was identified as a different species within a genus or a different species within a family, the result was considered to be an ambiguous species or genus. Ambiguous identifications were counted as correct identification, as per the study by Amandita et al. [42]. Sequences with an identification percentage of 99% or more were included in the novel sequence data for specific DNA barcoding for a species. Novel sequence data will be deposited in the GenBank database to assist in future identification.
Descriptive, statistical, and scatter plot analyses were used to gain understanding of the ITS2, matK, rbcL, and trnL regions and the relationship between factors in the BLAST analysis, with the identification being completed using the MINITAB Statistical Software.
In addition, Venn diagrams generated by Bioinformatics and Evolutionary Genomics (http://bioinformatics.psb.ugent.be/cgi-bin/liste/Venn/calculate_venn.htpl; accessed on 2 January 2022) were used to depict how many species were correctly identified by singular regions and by multiple combinations of regions, whether or now there was a correct identification within species, genus, or family level. Information about the species number per genus was obtained from POWO [34].
Sequence alignments were performed using the Muscle program. The nucleotide composition of all sequences obtained from the ITS2, matK, rbcL, and trnL regions were computed, and their genetic distances were calculated with Kimura 2 parameters (K2P) [54]. The K2P pairwise genetic distance is the percentage of nucleotide sequence divergence that was used by Hebert and colleagues [2]. All analyses were performed with the Molecular Evolutionary Genetics Analysis (MEGA X) software [55].
All the medicinal plant species information collected was analyzed and interpreted according to the use of the data in DNA barcoding with respect to conservation. Any correct identification can be used for DNA barcoding for related species and can be subsequently helpful for medicinal plant conservation, although the DNA barcoding can only be used for identification at species level and cannot estimate variation within species [56]. Any ambiguous identification can be used as an approach to species identification and thus may also be valuable for medicinal plant conservation.
Any new sequence or new DNA barcoding that is not available in NCBI or BOLD constitutes novel data. Species included in at least one of the following categories: IUCN Red List [40], CITES Appendixes I, II, or III [36], rare medicinal plants species [37], or Native and Endemic species [34] would require DNA barcoding more urgently than the non-listed species. Therefore the species were categorized in priority order A-M as follows: new DNA barcoding and can strongly assist medicinal plant (MP) conservation (A), can strongly assist MP conservation (B), new DNA barcoding and can assist MP conservation (C), can assist MP conservation (D), new to DNA bank data and new DNA barcoding and may strongly assist MP conservation (E), new DNA barcoding and may strongly assist MP conservation (F), may strongly assist MP conservation (G), new to DNA bank data and new DNA barcoding and may assist MP conservation (H), new DNA barcoding and may assist MP conservation (I), may assist MP conservation (J), new to DNA bank data and new DNA barcoding but sequences need to be clarified further (K), new DNA barcoding but sequences need to be clarified further (L) and sequences need to be clarified further (M).

Conclusions
Based on the results of this study, we conclude that no single region is perfectly ideal for DNA barcoding. Nonetheless, according to the observed criteria, we recommend matK as the core DNA barcoding region for Indonesian medicinal plant identification. In addition, due to its unique correct species identification, we recommended the ITS2 and rbcL regions as alternative or complementary regions to the core barcoding DNA using matK. DNA barcoding for 33 Indonesian medicinal plant species was provided; of these 33 species, 21 species were newly DNA barcoded; of these 21 species, three contributed novel DNA barcoding data to DNA bank. In the future, this guide and associated data will facilitate a means to identify Indonesian medicinal plants, particularly those that need to be conserved strongly, to assure a valid species rather than a substitute in herbal medicines and to prevent illegal trade.