Comparative Genomics and Phylogenetic Analysis of the Chloroplast Genomes in Three Medicinal Salvia Species for Bioexploration

To systematically determine their phylogenetic relationships and develop molecular markers for species discrimination of Salvia bowleyana, S. splendens, and S. officinalis, we sequenced their chloroplast genomes using the Illumina Hiseq 2500 platform. The chloroplast genomes length of S. bowleyana, S. splendens, and S. officinalis were 151,387 bp, 150,604 bp, and 151,163 bp, respectively. The six genes ndhB, rpl2, rpl23, rps7, rps12, and ycf2 were present in the IR regions. The chloroplast genomes of S. bowleyana, S. splendens, and S. officinalis contain 29 tandem repeats; 35, 29, 24 simple-sequence repeats, and 47, 49, 40 interspersed repeats, respectively. The three specific intergenic sequences (IGS) of rps16-trnQ-UUG, trnL-UAA-trnF-GAA, and trnM-CAU-atpE were found to discriminate the 23 Salvia species. A total of 91 intergenic spacer sequences were identified through genetic distance analysis. The two specific IGS regions (trnG-GCC-trnM-CAU and ycf3-trnS-GGA) have the highest K2p value identified in the three studied Salvia species. Furthermore, the phylogenetic tree showed that the 23 Salvia species formed a monophyletic group. Two pairs of genus-specific DNA barcode primers were found. The results will provide a solid foundation to understand the phylogenetic classification of the three Salvia species. Moreover, the specific intergenic regions can provide the probability to discriminate the Salvia species between the phenotype and the distinction of gene fragments.


Introduction
The Lamiaceae family is the sixth-largest family of flowering plants. It includes 10 subfamilies, 220 genera, and 3500 species [1]. Most of the species are mainly distributed in Asia, Europe, and Africa. In historical evolution, the family of Lamiaceae is most closely related to the family of Verbenaceae and Violinaceae [2]. In China, 99 genera and more than 800 species in the Lamiaceae family are found, which include about 1050 Salvia species. Among them, 78 varied species and 32 variants mostly grow in tropical or temperate areas [3]. Regarding the classification development of the Salvia genus, Bentham [4] once divided it into four subgenera and 12 groups and Briquet [5] divided it into 8 subgenera and 17 groups. There are also different taxonomic studies about the Salvia genus in the different regions, for example, subgenus Calosphace was divided into 91 groups by the American scientist Carl Epling [6], and the genus increased to 102 within the subsequent 20 years. In Europe and Africa, the Salvia genus is divided into four subgenera and eight

Morphological Characteristics of the Three Salvia Species
The three Salvia species have the common specifications of the Lamiaceae family: quadrangular stem, opposite leaves, corolla flower lip, and four nutlets. However, they have the obvious distinction from the phenotype of flower colors (1) varying from pink and purple (S. bowleyana and S. officinalis) to red (S. splendens). Moreover, the three Salvia species are perennial herbs with oblong or oval leaves (2), cymose inflorescences (3), and nutlets. Nevertheless, for S. bowleyana, the leaves are glabrous on both sides, only the veins are slightly pilose, and the top of the fruit is hairy (Figure 1a). For S. splendens, the stems, leaves on both sides, and petioles are not glabrous with glandular spots below. The fruits have irregular folds at the top, and narrow wings at the edge (Figure 1b). For Salvia officinalis, the stems, many branches, leaf surfaces, and petioles are covered with white short villi. The fruits are smooth and hairless (4) (Figure 1c) [1].

of 27
3 chemicals. Therefore, we sequenced and analyzed the chloroplast genomes of three Salvia species for the first time to identify divergence hotspots of phylogenetic genome regions and detect the applicability of phylogenomics for further resolving the evolutionary and systematic relationship in the Salvia genus of the Lamiaceae family.

Morphological Characteristics of the Three Salvia Species
The three Salvia species have the common specifications of the Lamiaceae family: quadrangular stem, opposite leaves, corolla flower lip, and four nutlets. However, they have the obvious distinction from the phenotype of flower colors (1) varying from pink and purple (S. bowleyana and S. officinalis) to red (S. splendens). Moreover, the three Salvia species are perennial herbs with oblong or oval leaves (2), cymose inflorescences (3), and nutlets. Nevertheless, for S. bowleyana, the leaves are glabrous on both sides, only the veins are slightly pilose, and the top of the fruit is hairy (Figure 1a). For S. splendens, the stems, leaves on both sides, and petioles are not glabrous with glandular spots below. The fruits have irregular folds at the top, and narrow wings at the edge (Figure 1b). For Salvia officinalis, the stems, many branches, leaf surfaces, and petioles are covered with white short villi. The fruits are smooth and hairless (4) (Figure 1c) [1].

Gene Compositions Comparison of 23 Salvia Species
Schematic representations of S. bowleyana, S. splendens, and S. officinalis chloroplast genomes are shown in Figure 2   S. officinalis (c) chloroplast genomes. Each map contains seven circles. From the center going outward, the first circle shows the distributed repeats connected with red (the forward direction) and green (the reverse direction) arcs. The next circle shows the tandem repeats marked with short bars. The third circle shows the microsatellite sequences as short bars. The fourth circle shows the size of the LSC and SSC. The fifth circle shows the IRA and IRB. The sixth circle shows the GC contents along the plastome. The seventh circle shows the genes having different colors based on their functional groups.

Gene Loss Analysis of the Chloroplast Genomes from 41 Species in the Lamiaceae Family
The gene losses of chloroplast genomes were analyzed in the 41 species of the Lamiaceae family that originated from the phylogenetic tree (Table 3). These species originated from 8 genera (Salvia, Rosmarinus, Agastache, Dracocephalum, Ajuga, Leonurus, Elsholtzia, and Caryopteris) of the Lamiaceae family. In the dual IR regions of chloroplast genomes, one of the rpl20 genes was stable and found in all 41 species; however, another one was found only in D. heterophyllum. Therefore, the intact rpl20 gene often can be used as the molecular signature gene in the angiosperm [37]. In addition, one of the ycf 1 genes was across the SSC and IRb regions, the other pseudogene was across the SSC and IRa regions. Loss of the first ycf 1 gene was observed in five chloroplast genomes of A. campylanthoides, A. ciliata, A. decumbens, A. lupulina, and A. nipponensis. Loss of the second one was not found in the chloroplast genomes of twenty-eight species except in the twelve chloroplast genomes from the six Salvia genus (S. digitaloides, S. daiguii, S. meiliensis, S. chanryoenica, S. yangii, and S. nilotica), A. rugosa, the four Dracocephalum genus (D. heterophyllum, D. taliense, D. tanguticum, and D. moldavica), and L. japonicas. As reported, in a total of 420 species, 357 species could be distinguished using ycf 1 by means of specific primers designed for the amplification of these regions [38]. Moreover, the losses of the ycf 15 genes occurred in five chloroplast genomes (S. hispanica, S. tiliifolia, S. chanryoenica, A. forrestii, and E. densa). Although the gene function of ycf 15 genes is unknown, the transcriptome analyses of the Camellia genus revealed that the ycf 15 gene was transcribed as a precursor polycistronic transcript which contained ycf 2, ycf 15, and antisense trnL-CAA [39]. Furthermore, the six genes in the LSC region, e.g., petN, accD, rps2, rps16, rps18, and rps19 were absent in the chloroplast genomes of C. trichosphaera, R. officinalis, D. moldavica, E. densa, D. heterophyllum, and L. japonicus, respectively. In contrast, in the SSC region, loss of the rpl32 and ndhD genes was found only in S. splendens and C. mongholica chloroplast genomes, respectively. Surprisingly, loss of the rpl32 gene can be transferred to the nucleus from the chloroplast genome of Euphorbia schimperi and this can be verified through the method of being sequenced in the nuclear transcriptome of E. schimperi (Table 3) [40]. The type of gene loss was mostly affirmed to be consistent with the topology of the evolutionary tree.

Analysis of Simple Sequence Repeats Polymorphism in the 23 Salvia Chloroplast Genomes
Repeat sequences have been commonly used as genetic markers to understand the evolution of the genus in the same family. Scattered (interspersed) repetition and tandem repetition sequences consisting of simple sequence repeats (SSRs) were analyzed in the 23 Salvia chloroplast genomes (Table S2, Figure 3). We analyzed the content and percentage of SSR sequences in the 23 Salvia species. The results showed that 16, 12, and 10 SSR contained "A" as the repeat unit and 18, 14, and 14 SSR contained "T" as the repeat unit among the total 34, 26, and 24 mononucleotide repeats (Table S2) in the chloroplast genomes of S. bowleyana, S. splendens, and S. officinalis, respectively. Moreover, the mononucleotide numbers of "A" and "T" as the repeat unit have an obvious difference. From the statistical results, the number of Poly A and Poly T repeats varied from 6 (S. yangii) to 16 (S. bowleyana and S. miltiorrhiza f. alba), from 9 (S. plebeia) to 21 (S. prattii). Rare numbers of Poly C and Poly G repeats were found only in the chloroplast genomes of S. hispanica, S. plebeia, and S. meiliensis [41]. One SSR with "AT" as the repeat unit was found in the eight Salvia chloroplast genomes of S. splendens, S. digitaloides, S. daiguii, S. hispanica, S. tiliifolia, S. chanryoenica, S. prattii, and S. roborowskii. Di-nucleotide SSR contained "TA" as the repeat unit in twelve chloroplast genomes of S. bowleyana, S. bulleyana, S. przewalskii, S. yunnanensis, S. miltiorrhiza f. alba, S. chanryoenica, S. prattii, S. roborowskii, S. splendens, S. daiguii, S. hispanica, and S. tiliifolia, respectively. Nevertheless, one trinucleotide SSR with "AAT" as the repeat unit was found in the chloroplast genome of S. yunnanensis (Table S2). The mononucleotide repeat unit is the most abundant type of the SSR repeats and it accounted for the proportion from 88% to 100% through comprehensive statistics of chloroplast genomes in the 23 Salvia species.
The +/-refers to the presence/absence of a gene in each species that does not have the gene. "+": presence; "-" absence; * rps19 is across the area of LSC and IRb (add family and order information).  repeats (c). In (a), the different types of SSRs are filled in blue (mono A), orange (mono C), purple (mono T), red (mono G), green (di AT), blue (di TA), and gray (Tri AAT) together marked with the detailed quantum in yellow and black within the diverse columns. In (b), the percentage of mononucleotides, dinucleotides, and trinucleotides is filled in blue, purple, and green together marked with the detailed quantum in yellow and black within the diverse columns. In (c), the number of repeats in the types of forward repeats (F), reverse repeats (R), palindromic repeats (P), and complement repeats (C) is filled in blue, green, purple, and orange together marked with the detailed quantum in black above the diverse columns. Mono: mononucleotide; Di: dinucleotide; Tri: trinucleotide; F: forward repeats, R: reverse repeats; P: palindromic repeats, and C: complement repeats.

Repeat Sequences Analysis in the Chloroplast Genomes of 23 salvia Species
Except for in the SSR analysis of the 23 Salvia chloroplast genome, 29 tandem repeats by each species were identified for all the four kinds of tandem repeats, including the forward repeats, reverse repeats, palindromic repeats, and complement repeats in the chloroplast genomes of S. bowleyana (11 forward repeats, 3 reverse repeats, and 15 palindromic repeats), S. splendens (11 forward repeats, 4 reverse repeats, and 14 palindromic repeats) and S. officinalis (10 forward repeats, 5 reverse repeats, and 14 palindromic repeats), respectively. The greatest numbers of repeat types were forward repeats and palindromic repeats, while the numbers of reverse repeats and complement repeats were less and the latter were found only in the six chloroplast genomes, including S. przewalskii, S. daiguii, S. meiliensis, S. merjamie, S. yangii, and S. nilotica. The comparison of the number of predicted tandem repeats is shown in Tables S3-S6, and Figure 3c.

Structures of the IR Boundaries and Gene Features from 23 Salvia Species
The IR boundaries structure was analyzed in the 23 Salvia chloroplast genomes of the Lamiaceae family. From the analysis, six distinct genes, rpl22, rps19, rpl2 (×2), ycf 1, ndhF, and psbA, were most explicitly found in the diverse regions or at the border regions of 23 chloroplast genomes ( Figure 4). Furthermore, the variation range of these gene lengths was similar and did not exceed 2%. The genes of rpl22 and psbA were located in the LSC region, whereas rpl2 genes were located in the two IR regions in these species. One of the rps19 genes was located at the border area of LSC and IRb in all species. In addition, small fragments of the rps19 genes (rps19 pseudogene) were found at the border regions of the

The Discrepancy of the 23 Salvia Chloroplast Genomes
The structures of chloroplast genomes are highly conserved. The medicinal plants can be accurately identified and distinguished by the comparison of barcodes from the whole chloroplast genome. The sequences of chloroplast genomes in the 23 Salvia species were analyzed using mVISTA, and the alignments were visualized with the Salvia bowleyana chloroplast genome as the reference genome ( Figure S3). We found the sequences of 23 Salvia chloroplast genomes were mostly identically conserved except for the three variable areas located in the intergenic regions of the LSC region. The first one was the IGS region (rps16-trnQ-UUG) found in the nine Salvia chloroplast genomes (S. officinalis, S. japonica, S. sclarea, S. meiliensis, S. hispanica, S. tiliifolia, S. yangii, S. splendens, S. nilotica) ( Figure S3 (A)). The second one was the IGS region (trnL-UAA-trnF-GAA) varied in the chloroplast genome of S. chanryoenica ( Figure S3 (B)). The last one was the IGS region (trnM (cau)-atpE) diversified in the three chloroplast genomes of S. chanryoenica, S. hispanica, and S. japonica ( Figure S3 (C)).
Interestingly, the two IGS regions of trnG-GCC-trnM-CAU ( Figure S4a, M1) and ycf 3-trnS-GGA ( Figure S4b, M2) were specific in the three studied species. We cloned the two regions and acquired the sequences of M1 (~300 bp) and M2 (~800bp) using Sanger sequencing (Table S12). Then, we comparatively analyzed the two molecular markers (MMs) among the three studied Salvia species to determine the variations, including indels and single nucleotide polymorphisms (SNP) ( Table 4, M1 and M2). The amplification products of the two IGS were checked and the strips were clearly shown on the agarose gel ( Figure S4). From the peak map (up) and sequencing results (down) of the three studied Salvia species with the pairs of primers from M1 and M2 (Figure 6), four variant loci of SNP or indels were found among them and marked A, B, C, and D, respectively, at Figure 6. Therefore, the three Salvia species can be successfully discriminated based on these SNP and indel loci by separately or unitedly using the two M1 and M2 molecular markers. The intergenic region s SNP (iSNP) has the potential to directly affect the protein structures or expression levels in accordance with the particular localization; therefore, it may affect the plant traits or genetic mechanisms [43]. In contrast to markers of the Salvia genus, two markers derived from the IGS regions of petN-psbM and psaJ-rpl33 can be successfully used to distinguish the five Alpinia species [44]. genus, two markers derived from the IGS regions of petN-psbM and psaJ-rpl33 can be successfully used to distinguish the five Alpinia species [44].

Identification and Comparison of the Genus-Specific DNA Barcodes Primer and Sequences
Primers can be designed from highly variable intergenic spacer sequences for PCR amplification. Then, we can distinguish the 23 Salvia species in the Lamiaceae family by sequence alignment and analysis using ecoPrimers software. After comparison, the two conservative intervals can be amplified through the designed PCR amplification primers to distinguish the 23 salvia genus. The primer sequences are shown in Table 4 (M3 and  M4). Surprisingly, the two pairs of primers can be used to amplify the sequences of trnM-CAU-atpE and ccsA-ndhD after comparison between the Salvia chloroplast genomes and the BlastN database. Furthermore, the alignment results based on the blast database indicate that the two pair primers can also especially suit other distinct species, e.g., Scutellaria genus (Lamiaceae), Camellia genus (Theaceae), Styrax genus (Styracaceae), Melissa genus (Lamiaceae), Eucalyptus genus (Myrtaceae), etc.

Phylogenetic Analysis
The sequences of chloroplast genomes are a valuable database for the research of the evolutionary relationship in plants. To determine the phylogenetic positions of the three Salvia species in the Lamiaceae family, 80 protein sequences were extracted using the Phy-loSuite software from the 43 chloroplast genomes in the species (Table S1). Among them, 25 shared CDS proteins sequences were found present in 43 species, including rpl14, rpl33,

Identification and Comparison of the Genus-Specific DNA Barcodes Primer and Sequences
Primers can be designed from highly variable intergenic spacer sequences for PCR amplification. Then, we can distinguish the 23 Salvia species in the Lamiaceae family by sequence alignment and analysis using ecoPrimers software. After comparison, the two conservative intervals can be amplified through the designed PCR amplification primers to distinguish the 23 salvia genus. The primer sequences are shown in Table 4 (M3 and M4). Surprisingly, the two pairs of primers can be used to amplify the sequences of trnM-CAU-atpE and ccsA-ndhD after comparison between the Salvia chloroplast genomes and the BlastN database. Furthermore, the alignment results based on the blast database indicate that the two pair primers can also especially suit other distinct species, e.g., Scutellaria genus (Lamiaceae), Camellia genus (Theaceae), Styrax genus (Styracaceae), Melissa genus (Lamiaceae), Eucalyptus genus (Myrtaceae), etc.

Phylogenetic Analysis
The sequences of chloroplast genomes are a valuable database for the research of the evolutionary relationship in plants. To determine the phylogenetic positions of the three Salvia species in the Lamiaceae family, 80 protein sequences were extracted using the PhyloSuite software from the 43 chloroplast genomes in the species (Table S1). Among them, 25 shared CDS proteins sequences were found present in 43 species, including rpl14, rpl33, rpl36, rps7, rps14, psbB, psbC, psbD, psbE, psbF, psbN, psaB, psaC, psaI, petA, petG, petL, ndhC, ndhG, cemA, atpA, atpB, atpH, atpI, and ycf 4 genes. We identified 29 proteins shared by 37 Lamiaceae species. However, there were only 25 proteins commonly shared in the studied 43 species. The other four proteins, including atpE, psbA, psbJ, and psbM, were only shared in 37 Lamiaceae species. The multiple sequence alignments of the 29 proteins are shown in Figure S5. Using L. chuanxiong (Apiaceae family) and P. notoginseng (Araliaceae family) as the outgroups, the phylogenetic tree was generated by three methods of maximum likelihood (ML), maximum parsimony (MP), and neighbor-joining (NJ) based on the above-described data of whole chloroplast genomes. The three phylogenetic trees showed the same evolutionary relationship, in which 41 species including 37 species of the Lamiaceae family and four species of the Verbenaceae family were clustered together with 6 obvious clades. Among them, five species including Dracocephalum species (D. heterophyllum, D. Taliense, D. tanguticum, and D. moldavica) and A. rugose were clustered into one branch; in contrast, 23 Salvia species and one Rosmarinus species (R. officinalis) were clustered into one branch with six subbranches (Figure 7). In addition, six species from the Ajuga genus and four species from the Caryopteris genus were clustered into the other two branches, respectively. Single species of L. japonicus and Elsholtzia densa were gathered into one branch, partly, whereas the species of outgroups were more distantly related to other species. The ML bootstrap showed strong support with bootstrap values of 100% for eight nodes. The phylogenetic results resolved 26 nodes with bootstrap support values of 54-100 and that of 17 nodes were ≥ 74% (Figure 7).

The Characteristics of Chloroplast Genomes and Genes in the Salvia Genus
In the chloroplast genomes of S. bowleyana, S. splendens, and S. officinalis, the total numbers of protein-coding genes were identical except that of S. Splendens was one less. The total numbers of tRNA and rRNA genes were the same as those of other Salvia species. These results indicated that the chloroplast genomes of the Salvia species were highly conserved. The selected 41 species from the Lamiaceae family and the two outgroup species (L. chuanxiong and P. notoginseng) possessed similar pharmacological effects, such as promoting blood circulation for removing blood stasis, increasing coronary flow, improving microcirculation, protecting the heart, improving the body hypoxia resistance, and having anti-hepatitis, antitumor, and antiviral effects [45]. Chloroplasts play an irreplaceable role in the formation of chemicals and the development of phenotypes due to the genes from nuclear, and mitochondrial genomes. However, the variability of the nuclear genome was found to be higher than that of the chloroplast genome and mitochondrial genome, as reported from the average genetic distance among all the strains of CWR and cultivated rice [46]. Therefore, it is indispensable to analyze the genetic divergence in the chloroplast genomes of Salvia species.

The Divergence between IGS Regions of the Salvia Genus Compared to Other Plants
It makes sense that the DNA sequences of the hypervariable regions and comparison of chloroplast genomes in three IGS regions of rps16-trnQ-UUG, trnL-UAA-trnF-GAA, and trnM(cau)-atpE can be used to distinguish the ten Salvia species (S. officinalis, S. japonica, S. sclarea, S. meiliensis, S. hispanica, S. tiliifolia, S. yangii, S. splendens, S. nilotica, and S. chanryoenica). The first IGS region has been found in the species of Zingiber officinale and Cofeeae alliance [47,48]. The second one commonly occurs in the angiosperm [49]. The last one has diversified and some parts of the oldest mtDNAs of trnV(uac)-trnM(cau)-atpE-atpB-rbcL were transferred from cpDNA to mtDNA since they have a common ancestor in extant gymnosperms and angiosperms [50]. As reported, the phylogenetic relationships in the Eurystachys clade were reconstructed utilizing nuclear ribosomal DNA sequences (nrETS, 5S-NTS) from 148 accessions into 12 well-supported genera, including widely recognized and well-defined segregates such as Prasium and Sideritis [51].
In contrast, the special IGS regions of the two iSNPs, namely trnG-GCC-trnM-CAU and ycf 3-trnS-GGA, were used to discriminate the three studied Salvia species. In previous studies, most of the SNPs were found in intergenic sequences, and the trnG-GCC-trnM-CAU was one of the maximum number of SNPs found four times to distinguish the six Saccharum species [52]. Meantime, the variable hotspot regions of ycf 3-trnS-GGA also can be useful as the candidate DNA barcodes for Adoxaceae and Caprifoliaceae species, and also for assessing interspecific divergence in Dipsacales species [53]. In addition, research has shown that the rps14 gene can be used as a DNA barcode for the identification of 34 Lamiaceae species collected from plants in the Pakistan area [54].
Therefore, the DNA barcode primers identified in the study can be potentially developed for the identification and phytotaxonomy of genus Salvia species through the divergence IGS regions.

The Functional Features of IR Regions and Genes of the Salvia Genus together with Other Plants
The sequences of IR can complement a certain segment of the upstream sequence downstream of the same DNA strand. They can then form a hairpin structure with a double helix stem and a single-stranded ring with a DNA double helix. The sequence between two reverse repeat units forms a single chain loop. Two copies are separated by a sequence or no interval sequence, which is in reverse series, and will form a specific palindrome sequence (P) [55]. Compared to the IRLC between the Papilionoideae subfamily [56] and the Lamiaceae family, they have the four common genes of ndhB, rpl23, ycf 1, and ycf 15.
In the IR regions, the genes of ndhB, rpl2, rpl23, rps7, rps12, and ycf 2 were present in the chloroplast genomes of 41 species, and these genes have a special function in the area of gene expressions. There are the five hypothetical coding regions genes of ycf 1, ycf 2, ycf 4, ycf 15, and two open reading frames (ORF42 and ORF56), which are also found in the chloroplast genomes of the other species, such as Clerodendranthus spicatus [57]. Both genes ycf 3 and ycf 4 were present in the LSC region of the 41 species chloroplast genomes. The sequence of ycf 3 is conserved in plants and contains three tetratrico-peptide repeats (TPR), which can act as the functions essential for the accumulation of the photosystem I (PSI) complex through a post-translational level [58,59]. The ycf 4 gene forms modules that mediate PSI assembly and facilitate the integration of peripheral PSI subunits and LHCIs into the PSI reaction center subcomplex [60].

DNA Extraction, Determination of DNA Quality, and PCR Amplification Products
Total genomic DNA was extracted from the 20 dried leaves for sequencing of chloroplast genome and fresh leaves were taken from the three single plants for cloning the DNA barcode sequences using a plant genomic DNA kit (Tiangen Biotech, Beijing, China). The extraction of DNA is a universal technology as the flowchart shows in Dr. Li s research [61]. We firstly ground plant tissues with liquid nitrogen. Then, we added the GPS buffer, RNase A, GPA buffer, absolute ethyl alcohol, deprotein fluid RD, bleach solution PW, and elution buffer TB. Next, we loaded the collection solutions on a column. During the courses, we mixed the solution and centrifuged each step. Lastly, the DNA bound to the column was eluted with an elution buffer. The DNA purity and amplification products were detected by 1.0% agarose gel electrophoresis stained with ethidium bromide alongside a 100 bp ladder (New England Biolabs, Ipswitch, MA, USA) using the DNA marker as the reference to determine the size of the amplified fragments (Takara) [62]. Otherwise, DNA concentration was determined using the Nanodrop spectrophotometer 2000 (Thermo, Waltham, Massachusetts, USA). Furthermore, the extraction of chloroplast DNA (cpDNA) for whole plastid genome sequencing should undergo three stages: separation of chloroplasts from cells, purification of chloroplasts, and isolation of cpDNA [63].

Chloroplast Genome Sequencing, Assembly, Annotation, and Manual Curation
DNA extracts containing the DNA concentration of 500 ng were applied to construct a library with lengths of short-insert fragments of 500 bps. The library was sequenced in a pair-end model with a read length of 150 bp on an Illumina Hiseq 2500 platform in accordance with the MiSeq platform provided by the manufacturer s directions [64]. The sequencing raw data were acquired from S. bowleyana, S. splendens, and S. officinalis with sizes of 7.1 Gbs, 6.8 Gbs, and 7.02 Gbs and 250bps pair-end read lengths, respectively. The raw data were submitted to the NCBI database and assigned the Sequence Read Archive (SRA) accession numbers SRR14415377, SRR17843445, and SRR17853381, respectively. The raw reads were filtered using Trimmomatic 0.35 with default parameters to remove adapters and low-quality bases [65]. The three chloroplast genomes were assembled using the NOVOPlasty (v 4.2) software [66] with the default parameters and the rbcL sequences as the seed. After that, we annotated these genomes using the CpGAVAS2 web service (http://www.herbalgenomics.org/cpgavas2/, accessed on 1 May 2022) [67]. The annotation errors were manually corrected using the Apollo software [68]. The assembly and the annotation results of S. bowleyana, S. splendens, and S. officinalis were submitted to GenBank with the accession numbers OM617845, OM617847, and OM617846, respectively.

Visualization and Analysis of Genome Content, cis-and Trans-Splicing genes
The chloroplast genome structure, cis-splicing genes, and trans-splicing PCGs were visualized using CPGview-RSG software (http://www.1kmpg.cn/cpgview/, accessed on 1 May 2022) [69]. The gene contents of 41 studied species (Table S1) were analyzed including the length of the complete genome sequences and the four regions, all genes, CDS, tRNAs, and rRNAs.

Repeat Analysis
We annotated the repeat sequences using the CPGAVAS2 for the chloroplast genomes of S. bowleyana, S. splendens, and S. officinalis. The SSRs of 23 Salvia species were identified using MISA software (http://pgrc.ipk-gatersleben.de/misa/, accessed on 1 May 2022) [70], also called the microsatellite sequence. The minimum numbers of repeat units for mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, hexanucleotide, and hexagenucleotide were set as 10, 5, 4, 3, 3, 3 and 3, respectively. The minimum distance between the 2 SSRs was set to 100 bp. If the distance was less than 100 bp, the two SSRs were treated as a composite microsatellite. The tandem repeats sequence (TRS) of the 23 Salvia chloroplast genomes was predicted using the Tandem Repeats Finder (TRF) software [71]. The interspersed repeats sequence (IRS) was predicted using the REPuter program (https://bibiserv.cebitec.uni-bielefeld.de/reputer, accessed on 1 May 2022), with the parameters as follows: maximum computed repeats = 30 and minimal repeat size = 8) [72]. The comparison of the chloroplast genomes was conducted using VMATCH software (Professor Stefan Kurtz, Computer Science at the Center for Bioinformatics, University of Hamburg, Germany) [73].

Comparative Genomic Analysis
We downloaded 40 chloroplast genomes sequences from the GenBank database including 38 species from the Lamiaceae family and two outgroups (Ligusticum chuanxiong from the Apiaceae family and Panax notoginseng from the Araliaceae family, for further analysis. The boundaries of the LSC, SSC, and IR regions boundary of chloroplast genomes from 23 Salvia species were visualized using the IR scope software (https://irscope.shinyapps.io/irapp/) [74] and the characteristic genes including the diverse areas were analyzed. The chloroplast genome sequences of 23 species from Salvia genera were compared with the annotated S. bowleyana chloroplast as the reference using the mVISTA program in a Shuffle-LAGAN mode with default parameters (Rank VISTA probability threshold = 0.5) [75,76]. The genetic distances of IGS regions from the chloroplast genomes of 23 Salvia species were calculated using the distmat program from EMBOSS (v6.3.1) [77] with the Kimura 2-parameters (K2p) evolutionary model [78].

Primer Identification and Design, PCR Amplification, Sequencing, and Analysis of Genus-Specific DNA Barcode Sequences
To discover DNA barcode sequences that can distinguish the 23 Salvia species, especially the three studied species, we analyzed the PCR amplification primers from their chloroplast genome sequences using ecoPrimers software [79]. Moreover, the sequences of two pairs of primers were compared to the other species through the CBI Multiple Sequence Alignment Viewer (Version 1.21.0, Max Seq Difference = 0.75) from the BLASTN website (https://blast.ncbi.nlm.nih.gov/) [80]. The two pairs of specific primers were designed to differently amplify the specific IGS regions identified in the three studied Salvia species by the Primer 3 software [81]. The PCR amplification system for genus-specific DNA barcode sequences of each reaction included 12.5 µL of 2 Taq PCR Master Mix (TransGen Biotech), 1.0 µL of each primer (0.4 µM), 2.0 µL of extracted template DNA, and ddH 2 O added to a final volume of 25 µL [82]. A negative control (Milli-Q water in place of DNA template) was included in each PCR to ensure there was no contamination. All the amplifications were performed on a Pro-Flex PCR system (Applied Biosystems, Waltham, MA, USA) instrument with the amplification procedures: degeneration 94 • C for 2 min followed by 35 cycles of 94 • C for 30 s, 57 • C for 30 s, 72 • C for 60 s, and a final extension step at 72 • C for 2 min. The amplification products were saved at 4 • C and sequenced at SinoGenoMax Co., Ltd. using the Sanger sequencing platform with the same cloning primers on the ABI Prism 3730 Genetic Analyzer (Applied Biosystems, USA). The sequences were spliced and analyzed by the GeneDoc software (3.2) [83].

Conclusions
The complete chloroplast genomes of S. bowleyana, S. splendens, and S. officinalis were acquired using Illumina sequencing technology. These three species can be easily discriminated from the phenotype. Phylogenetic analysis showed that 23 Salvia species and one Rosmarinus genus were clustered into one branch with six subbranches, of which the three studied species were included in the diverse branches. The sequence divergence found seven sites of IGS regions: rps16-trnQ-UUG, trnL-UAA-trnF-GAA, trnM-CAU-atpE, trnL-UAG-ccsA, ccsA-ndhD, rps15-ycf 1, and ndhE-ndhG. Notably, the two IGS regions of trnG-GCC-trnM-CAU and ycf 3-trnS-GGA were identified in the three studied Salvia species. The sequences divergence had a high variability and indicates they can be developed as DNA markers for further identification and phytotaxonomy of the Salvia genus. Overall, the data obtained will contribute to further development of the authentication, diversity, ecology, taxonomy, phylogenetic evolution, and conservation of the Salvia genus in China.