Seven Complete Chloroplast Genomes from Symplocos: Genome Organization and Comparative Analysis

In the present study, chloroplast genome sequences of four species of Symplocos (S. chinensis for. pilosa, S. prunifolia, S. coreana, and S. tanakana) from South Korea were obtained by Ion Torrent sequencing and compared with the sequences of three previously reported Symplocos chloroplast genomes from different species. The length of the Symplocos chloroplast genome ranged from 156,961 to 157,365 bp. Overall, 132 genes including 87 functional genes, 37 tRNA genes, and eight rRNA genes were identified in all Symplocos chloroplast genomes. The gene order and contents were highly similar across the seven species. The coding regions were more conserved than the non-coding regions, and the large single-copy and small single-copy regions were less conserved than the inverted repeat regions. We identified five new hotspot regions (rbcL, ycf4, psaJ, rpl22, and ycf1) that can be used as barcodes or species-specific Symplocos molecular markers. These four novel chloroplast genomes provide basic information on the plastid genome of Symplocos and enable better taxonomic characterization of this genus.


Introduction
Chloroplasts (CPs) are characteristic plant organelles that play an important role in photosynthesis. The CP genome is markedly similar across most land plant lineages in terms of gene order, gene content, structure, and intron content [1]. The CP genome can harbor as many as 101-118 genes including 66-82 protein-coding genes, 29-32 tRNA genes, and four rRNA genes [2]. CPs contain independently replicated genomes, most of which exhibit a four-segment molecular structure with a large single-copy (LSC, 80-90 kb in length) and a small single-copy (SSC, 16-27 kb in length) region separated by a pair of inverted repeats (IRa and IRb, 20-28 kb in length) [1,3]. However, this typical structure is altered in some plant lineages. For instance, in Cupressaceae [4] and Taxaceae [5], one IR has been lost. In Pinaceae, the IR length is reduced to below 1 kb [6,7]. In contrast, in Ericaceae, IR region expansion resulted in a significant decrease in the SSC region size [8,9]. Additionally, events such as rearrangement, gene loss, gene replication, pseudogene generation, and intron gain/loss have occurred in the CP genomes of various plant lineages [10,11]. CPs are frequently used in taxonomic and evolutionary studies [12] as they are uniparentally inherited (mostly maternally transmitted, but paternally transmitted in conifers), have well-preserved gene arrangement and content, and small size [13].
The genus Symplocos Jacquin consists of woody flowering plants found mainly in humid tropical forest woods, with approximately 300 species distributed in the New World and the Western Pacific Rim [14]. Symplocos was originally recognized as the sole genus of Symplocaceae Jacquin [7,15], but the Angiosperm Phylogeny Group [16,17] now recognizes two Symplocos genera (Cordyloblaste Moritzi and Symplocos). Although several molecular studies of Symplocos have been conducted and supported their monophyly, only some protein-coding gene sequences (rpl16, matK) and partial non-coding sequences (nr-ITS, trnL-trnF, trnC-trnD, and trnH-psbA) were used in the analyses, and no genomic comparative analyses of Symplocos species have been conducted to date [14,18,19].
Four Symplocos species are endemic to South Korea [20][21][22]. Of these, S. prunifolia Siebold & Zucc. and S. coreana (H. Lév.) Ohwi grow only on Jeju Island in South Korea [23]. S. prunifolia is classified as an endangered, rare plant [24]. Thus, comparing the CP genomes of these species is essential to enable the discrimination of these species at the molecular level and supports the ongoing conservation of these plants.
To date, the CP genome sequences of only three Symplocos species (S. paniculata [Thunb.] Miq., S. ovatilobata Noot., and S. costaricana Hemsl.) have been deposited in the National Center for Biotechnology Information database (NCBI), and no complete CP genome sequence of the Korean Symplocos species has been reported. In the present study, we aimed to sequence the CP genomes of four species of South Korean Symplocos. These CP genomes will provide the basis for studying the evolutionary history of Symplocos species and enable accurate taxonomic identification of vulnerable species.

Sample Collection, DNA Extraction, and CP Genome Sequencing
Fresh leaf samples were obtained from four Symplocos species growing on Jeju Island in South Korea, and total genomic DNA was extracted using a Plant SV Mini Kit (GeneAll Biotechnology, Seoul, Korea), according to the manufacturer's instructions. Intact leaf specimens were deposited into the herbarium at the Warm Temperate and Subtropical Forest Research Center (WTFRC; Table 1). The extracted DNA was quantified using a spectrophotometer (ND-1000, Nano-Drop Technologies, Wilmington, DE, USA). Genomic DNA libraries were produced, amplified, and sequenced using an Ion Xpress™ Plus Fragment Library Kit (Thermo Fisher Scientific, Waltham, MA, USA), Ion PI™ Hi-Q™ Sequencing 200 Kit (Thermo Fisher Scientific), and Ion PI™ Chip v3 Kit (Thermo Fisher Scientific).

CP Genome Assembly and Annotation
CP DNA data were filtered using SPAdes [25]. Four CP genomes were assembled using Geneious 10.2.6 [26] and annotated using DOGMA [27], followed by manual editing of non-annotated portions such as exons and introns. The tRNA sequences were confirmed using tRNAscan-SE 1.21 [28]. All annotations were checked against the reference genomes (MG719832, MF770705, and MF179496). Genome maps were drawn using OrganellarGenomeDRAW (OGDRAW) [29].

Genome Comparison
The CP genomes were aligned using MAFFT [30]. The complete CP genomes of the seven Symplocos species were compared using m-VISTA [31]. Additionally, the CP genome junctions were visualized and compared using IRscope [32].

Simple-Sequence Repeat (SSR) and Long Repeat Sequence Analysis
SSR within the seven CP genomes were detected using the MISA Perl script (MI-croSAtellite) [33]. The minimum number of mononucleotide repeats was set to 10; that of dinucleotide repeats to 5; trinucleotide repeats to 4; and tetra-, penta-, and hexanucleotide repeats to 3. REPuter was used to identify forward, reverse, complementary, and palindromic sequences with a minimum repeat size of 30 bp and the sequence identity set to 90% [34].

Divergent Hotspot Identification
The seven Symplocos CP genomes were aligned using MAFFT and Geneious 10.2.6. Nucleotide diversity was analyzed using DnaSP version 6.12.03. [35], with the window length set to 800 bp and the step size set to 200 bp.

Comparison of CP Genomes of Seven Symplocos Species
We used m-VISTA to compare the gene sequences and content of seven CP genomes of Symplocos. The analyzed Symplocos CP genomes were almost identical ( Figure 2), with the coding regions more conserved than the non-coding regions, and the IR regions more conserved than the LSC and SSC regions. Boundary structure was compared for the seven Symplocos CP genomes. The overall identity of the CP genomes was confirmed at the JLB (LSC/IRb) and JLA (IRa/LSC) junctions. Furthermore, trnH-GUG was located in the LSC region, 11 bp away from the JLA junction. However, the length of ψycf1 genes ranged from 982 bp to 1053 bp in the IRa region. The CP genome of S. coreana includes 1 bp of the ψycf1 pseudogene at the JSB junction. However, integration of 2 bp of the ndhF gene into the IRb was observed in other species. At the JLB junction, 15 bp of the rps19 gene was included in the IR region in the seven species ( Figure 3).

Gene Category
Gene Group Gene Names

Comparison of CP Genomes of Seven Symplocos Species
We used m-VISTA to compare the gene sequences and content of seven CP genomes of Symplocos. The analyzed Symplocos CP genomes were almost identical ( Figure 2), with the coding regions more conserved than the non-coding regions, and the IR regions more conserved than the LSC and SSC regions. Boundary structure was compared for the seven Symplocos CP genomes. The overall identity of the CP genomes was confirmed at the JLB (LSC/IRb) and JLA (IRa/LSC) junctions. Furthermore, trnH-GUG was located in the LSC region, 11 bp away from the JLA junction. However, the length of ψycf1 genes ranged from 982 bp to 1053 bp in the IRa region. The CP genome of S. coreana includes 1 bp of the ψycf1 pseudogene at the JSB junction. However, integration of 2 bp of the ndhF gene into the IRb was observed in other species. At the JLB junction, 15 bp of the rps19 gene was included in the IR region in the seven species (Figure 3).

SSR and Long Repeat Analysis
SSRs (microsatellites) are tandem repeats of 1-6 nucleotide motifs. The distribution of SSRs was analyzed in seven Symplocos CP genomes using MISA. We identified 44-59 repeats across the genomes. Mononucleotide repeats were the most abundant SSR in all species. Dinucleotide repeats were the least abundant, with three dinucleotide repeats in S. coreana only. S. coreana and S. costaricana harbored two trinucleotide repeats, and the rest of the species harbored only one. The number of tetranucleotide repeats was the lowest in S. coreana. Three pentanucleotide repeats were identified in S. costaricana and S. ovatilobata, and two in S. coreana. Hexanucleotide repeats were not identified in any of the species (Figure 4). Most SSRs consisted of the A/T motif rather than the G/C motif (Table 3 and Supplementary Data File S1).

SSR and Long Repeat Analysis
SSRs (microsatellites) are tandem repeats of 1-6 nucleotide motifs. The distribution of SSRs was analyzed in seven Symplocos CP genomes using MISA. We identified 44-59 repeats across the genomes. Mononucleotide repeats were the most abundant SSR in all species. Dinucleotide repeats were the least abundant, with three dinucleotide repeats in S. coreana only. S. coreana and S. costaricana harbored two trinucleotide repeats, and the rest of the species harbored only one. The number of tetranucleotide repeats was the lowest in S. coreana. Three pentanucleotide repeats were identified in S. costaricana and S. ovatilobata, and two in S. coreana. Hexanucleotide repeats were not identified in any of the species (Figure 4). Most SSRs consisted of the A/T motif rather than the G/C motif (Table  3 and Supplementary Data File S1).  The long repeat analysis identified more forward and palindromic repeats than reverse and complementary repeats in the seven Symplocos species. There were 29 long repeats across the seven species. Only one reverse repeat was found in S. coreana and S. ovatilobata (Figure 5a). The length of most repeats ranged from 20 to 29 bp, whereas the largest repeat was 46-bp long (S. coreana, S. prunifolia, and S. tanakana, Figure 5b). The location and number of iterations of the long repeats are shown in Table 4 and Supplementary Data File S2. SSR analysis of these species has helped to identify potential molecular markers for species-level identification of Symplocos.      The long repeat analysis identified more forward and palindromic repeats than reverse and complementary repeats in the seven Symplocos species. There were 29 long repeats across the seven species. Only one reverse repeat was found in S. coreana and S. ovatilobata (Figure 5a). The length of most repeats ranged from 20 to 29 bp, whereas the largest repeat was 46-bp long (S. coreana, S. prunifolia, and S. tanakana, Figure 5b). The location and number of iterations of the long repeats are shown in Table 4 and Supplementary Data File S2. SSR analysis of these species has helped to identify potential molecular markers for species-level identification of Symplocos.

Divergent Hotspots in the Symplocos CP Genome
Mutations that affect only a single nucleotide are called single nucleotide polymorphisms (SNPs). Overall, 1580 SNPs were identified in the CP genomes of seven Symplocos species (Figure 6 and Supplementary Data File S3). The level of sequence divergence was determined by calculating the nucleotide variability values for all CP genomes. Further-

Divergent Hotspots in the Symplocos CP Genome
Mutations that affect only a single nucleotide are called single nucleotide polymorphisms (SNPs). Overall, 1580 SNPs were identified in the CP genomes of seven Symplocos species (Figure 6 and Supplementary Data File S3). The level of sequence divergence was determined by calculating the nucleotide variability values for all CP genomes. Furthermore, 634 (40.1%) SNPs were located in the coding regions, and 946 (59.9%) in the intergenic spacer (IGS) region and introns. The average nucleotide diversity (Pi) for SNPs in the coding sequence (CDS) ranged from 0.00049 (clpP) to 0.00974 (rbcL), with an average value of 0.0038. The Pi value for SNPs in the IGS ranged from 0.00066 (trnN-GUU~ndhF) to 0.0197 (rpl36~infA), with an average value of 0.0077. SNPs were identified in three tRNA genes and in the 23S rRNA gene. Figure 7 shows the minimum, maximum, and average Pi values for five classes of genomic regions: CDSs, tRNAs, rRNAs, IGSs, and introns. The divergence of IGS was almost twice that of the next highest grade (CDS). rRNAs showed the lowest sequence divergence, with an average of 0.00027 (23S rRNA).
Pi values for five classes of genomic regions: CDSs, tRNAs, rRNAs, IGSs, and introns. The divergence of IGS was almost twice that of the next highest grade (CDS). rRNAs showed the lowest sequence divergence, with an average of 0.00027 (23S rRNA).
These divergent hotspot regions could be used as markers for phylogenetic characterization of the Symplocos species, with more divergence observed in the non-coding regions than in the coding regions. We compared the whole CP genome and found differences in some regions between the seven species: trnH-GUG~psbA, psbI~trnS-GCU, rpoC1-rpoB, rpl36-infA, psbL-psbF, rpl36-infA, and ccsA~ndhD. The five highly variable regions were identified based on a significantly higher Pi value of > 0.008 (rbcL, ycf4, psaJ, rpl22, and ycf1 genes (Supplementary Data File S3)). Identification of species-level differences is essential to the ongoing conservation of vulnerable members of the Symplocos genus. Figure 6. Sliding window analysis of the whole chloroplast genome nucleotide diversity (Pi) among seven Symplocos species.

Phylogenetic Analysis
CP sequences are increasingly used to construct plant phylogenies. Phylogenetic analysis was performed using the ML method, using 80 genes from 11 analyzed genomes including the four newly-analyzed Symplocos CP genomes and three previously reported Symplocos CP genomes (Figure 8). The resulting phylogeny shows that the monophyly of Symplocaceae clade is highly bootstrap supported (BS = 100). S. coreana is the most closely related to S. ovatilobata, forming the first branching taxa of Symplocaceae, with high bootstrap support (BS = 100). S. chinensis for. pilosa is most closely related to S. tanakana (BS = 100), and S. prunifolia is most closely related to S. paniculata (BS = 100).  These divergent hotspot regions could be used as markers for phylogenetic characterization of the Symplocos species, with more divergence observed in the non-coding regions than in the coding regions. We compared the whole CP genome and found differences in some regions between the seven species: trnH-GUG~psbA, psbI~trnS-GCU, rpoC1-rpoB, rpl36-infA, psbL-psbF, rpl36-infA, and ccsA~ndhD. The five highly variable regions were identified based on a significantly higher Pi value of > 0.008 (rbcL, ycf4, psaJ, rpl22, and ycf1 genes (Supplementary Data File S3)). Identification of species-level differences is essential to the ongoing conservation of vulnerable members of the Symplocos genus.

Phylogenetic Analysis
CP sequences are increasingly used to construct plant phylogenies. Phylogenetic analysis was performed using the ML method, using 80 genes from 11 analyzed genomes including the four newly-analyzed Symplocos CP genomes and three previously reported Symplocos CP genomes (Figure 8). The resulting phylogeny shows that the monophyly of Symplocaceae clade is highly bootstrap supported (BS = 100). S. coreana is the most closely related to S. ovatilobata, forming the first branching taxa of Symplocaceae, with high bootstrap support (BS = 100). S. chinensis for. pilosa is most closely related to S. tanakana (BS = 100), and S. prunifolia is most closely related to S. paniculata (BS = 100).

Discussion
Some Symplocos species have long been used for medicinal purposes and dyes, especially S. racemosa Roxb., which is an important traditional Indian drug used to treat liver and uterine disorders and leucorrhea [38], and S. tanakana, which has been used as a mordant in South Korea [39]. These species can be used as a new bio-industrial material in the

Discussion
Some Symplocos species have long been used for medicinal purposes and dyes, especially S. racemosa Roxb., which is an important traditional Indian drug used to treat liver and uterine disorders and leucorrhea [38], and S. tanakana, which has been used as a mordant in South Korea [39]. These species can be used as a new bio-industrial material in the future, and we intend to provide basic data for genetic information through genome comparison. In the present study, we report the CP genome structure of four Symplocos species from South Korea, and present a novel comparative study of Symplocos species genomes. The CP genomes of Symplocos species reported here are well conserved, with an equal number of genes, gene order, and genome structure including its traditional quadripartite molecular structures. These findings are consistent with those for CP genomes of other Symplocaceae species [40]. The genome lengths ranged from 159,961 bp (S. chinensis for. pilosa) to 157,365 bp (S. coreana) (Figure 1).
SSRs are often used as molecular (genetic) markers in conservation biology, population genetics, polymorphism investigations, and evolutionary biology because of their codominant properties and high reproducibility of analysis [41][42][43]. We identified 369 SSRs in seven Symplocos CP genomes, most of which were located in the intergenic regions. The Symplocos CP genome has a high A/T content; accordingly, most of the detected mononucleotide repeats were composed of A/T. These SSRs can be used as molecular markers for genetic diversity analysis and genetic evolution studies [44].
The IR region of the CP genome provides stability under various stress conditions [45]. However, the IR region has contracted and expanded in different species during CP evolution [46]. Recently, an expansion of the IR region in Clematis was confirmed [47]. In Symplocos, the IR is very stable, with the only differences in the length of ψycf1 [10,[48][49][50]. However, some Ericales CP genome families differ significantly from others, confirming rearrangements and changes in the IR region length during evolution. In particular, Rhododendron, Vaccinium, and Arbutus showed extreme shortening of the SSC region due to IR expansion [51][52][53].
Molecular markers with high sequence variation are useful for species identification and phylogenetic research in land plants [49,54]. To date, there have been no studies on the identification of species-level molecular markers for Symplocos. Many phylogenetic studies of seed plants have used the CP genome for species-level identification [52,[55][56][57]. Wang et al. [14] phylogenetically studied Symplocos based on the sampling of about 111 species using the nuclear ribosomal internal transcribed spacer (nr-ITS) and three chloroplast markers (rpl16, matK, and trnL-trnF regions). Of the four traditionally recognized subgenera, the subgenus Hopea, distributed in East Asia, is monophyletic and sister to a group comprising all other Symplocos species, but the phylogenetic relationship of some taxa was not clear. Furthermore, Fritsch et al. sampled 74 species and their results were consistent with Wang's findings [18]. Soejima and Nagamasu sampled 30 species distributed in Japan and conducted studies based on nr-ITS, trnL-trnF, and trnH-psbA regions, suggesting that the section Palura, a deciduous group in the subgenus Hopea, has an independent status [19]. Previously used cp regions identified in this study had relatively low nucleotide diversity (rpl16: 0.00303, matK: 0.0074, trnL-trnF: 0.00234). Moreover, 17 genes including psbI, psbL, ndhB, and ndhE were identical in the seven Symplocos species analyzed (Pi = 0, Supplementary Data File S3). These genes are not suitable for Symplocos molecular studies. In our study, high sequence divergence was detected in the following regions: trnH-GUG~psbA, psbI~trnS-GCU, rpoC1-rpoB, rpl36-infA, psbL-psbF, rpl36-infA, ccsA~ndhD, rbcL, ycf4, psaJ, rpl22, and ycf1 ( Figure 6 and Supplementary Data File S3). The regions with high nucleotide diversity identified in the current study can be used in molecular studies (e.g., to confirm the molecular phylogeny of Symplocos). In addition, the phylogenetic analysis and identification of divergence hotspots in the current study provide fundamental data for understanding the relationships among Symplocaceae species. Further studies involving extensive sampling may be needed to better understand the detailed phylogenetic relationships among Symplocaceae and their evolutionary history.

Conclusions
In the present study, we sequenced the complete CP genomes of four Symplocos species: S. chinensis for. pilosa, S. coreana, S. prunifolia, and S. tanakana. We demonstrated that Symplocos species are separated into four phylogenetic groups: (1) the S. coreana-S. ovatilobata group; (2) the S. costaricana group; (3) the S. chinensis for. pilosa-S. tanakana group; and (4) the S. prunifolia-S. paniculata group. Additionally, important genetic information including that on SNPs, SSRs, long repeats, divergent hotspot regions, and phylogeny was obtained. Technological advances in plant science have made the CP genome an important tool for plant research. The complete CP genomic data of Symplocos will provide useful information for studying genetic diversity and species identification, and the current study could be used for phylogenetic studies of Symplocos and whole-CP genome comparisons.