Complete Chloroplast Genome of Clethra fargesii Franch., an Original Sympetalous Plant from Central China: Comparative Analysis, Adaptive Evolution, and Phylogenetic Relationships

: Clethra fargesii , an essential ecological and endemic woody plant of the genus Clethra in Clethraceae, is widely distributed in Central China. So far, there have been a paucity of studies on its chloroplast genome. In the present study, we sequenced and assembled the complete chloroplast genome of C. fargesii . We also analyzed the chloroplast genome features and compared them to Clethra delavayi and other closely related species in Ericales. The complete chloroplast genome is 157,486 bp in length, including a large single-copy (LSC) region of 87,034 bp and a small single-copy (SSC) region of 18,492 bp, separated by a pair of inverted repeat (IR) regions of 25,980 bp. The GC content of the whole genome is 37.3%, while those in LSC, SSC, and IR regions are 35.4%, 30.7%, and 43.0%, respectively. The chloroplast genome of C. fargesii encodes 132 genes in total, including 87 protein-coding genes (PCGs), 37 tRNA genes, and eight rRNA genes. A total of 26,407 codons and 73 SSRs were identiﬁed in C. fargesii chloroplast genome. Additionally, we postulated and demonstrated that the structure of the chloroplast genome in Clethra species may present evolutionary conservation based on the comparative analysis of genome features and genome alignment among eight Ericales species. The low Pi values revealed evolutionary conservation based on the nucleotide diversity analysis of chloroplast genome in two Clethra species. The low selection pressure was shown by a few positively selected genes by adaptive evolution analysis using 80 coding sequences (CDSs) of the chloroplast genomes of two Clethra species. The phylogenetic tree showed that Clethraceae and Ericaceae are sister clades, which reconﬁrm the previous hypothesis that Clethra is highly conserved in the chloroplast genome using 75 CDSs of chloroplast genome among 40 species. The genome information and analysis results presented in this study are valuable for further study on the intraspecies identiﬁcation, biogeographic analysis, and phylogenetic relationship in Clethraceae.


Introduction
Clethra Gronov. ex L. is an important genus in the family Clethraceae composed of deciduous or evergreen shrubs or trees [1]. The genus comprises more than 70 species worldwide and naturally distributes in the mountainous and hilly areas ranging from tropical to subtropical areas in Eastern and Southeast Asia, Southeastern of North America, Central America, Eastern Brazil, and Madeira Islands [1,2]. The fragrant flowers, beautiful tree crowns, and white inflorescences make it a good ornamental prospect in gardens [3]. It can be also used as ecological and environmental protection plants to restore the destroyed used to find inverted repeats (IR) regions and annotate the whole chloroplast genomes using Amborella trichopoda, C. delavayi, and some other related species as the reference [26]. The software Geneious-v10.2.3 [27] was used to check the annotated genes and detected errors corrected manually. Additionally, the annotated chloroplast genome sequence was submitted to GenBank (GenBank No.: MT742578) on the NCBI website. The circular chloroplast genomic map of C. fargesii was drawn and visualized using Chloroplot online software (https://irscope.shinyapps.io/Chloroplot/ (accessed on 1 April 2021)) [28].

Comparative Analysis of the Chloroplast Genome
The chloroplast genome features of C. fargesii were analyzed using the Geneious-v10.2.3 software by comparing seven other available chloroplast genomes in Ericales downloaded from the NCBI website (Table S1). The seven Ericale species were selected to verify the conservation of C. fargesii chloroplast genome based on the related gradient close to C. fargesii in phylogeny. Further, multiple genome alignment analysis was done using the Mauve program [29]. Excluding two Ericaceae species, six Ericales species chloroplast genomes were used to analyze junction characteristics and codon usage bias considering the cases of great gene losses and extreme expansion or contraction of IR region in the chloroplast genomes of Ericales species. The figure of junction characteristics was drawn by AI software and MEGA7.0 was used to analyze the codon usage bias with the RSCU (relative synonymous codon usage) values based on the CDS region [30].

SSRs and Nucleotide Diversity Analysis
The simple sequence repeats (SSRs) of the chloroplast genomes of C. fargesii and C. delavayi were analyzed using the software MISA [31,32] with the basic repeat setting: 10 for mononucleotides, 4 for dinucleotides, 4 for trinucleotides, 3 for tetranucleotides, 3 for pentanucleotides, and 3 for hexanucleotide repeats. The DnaSP-v5.10 software was used to calculate nucleotide variability (Pi) values and variable sites using the aligned chloroplast genome sequences of C. fargesii and C. delavayi with a window length of 600 bp and a step size of 200 bp [33].

Adaptive Evolution Analysis
The positive selection sites were evaluated in 80 coding sequences (CDSs) of the chloroplast genomes of C. fargesii and C. delavayi using the PAML-v4.7 [34] package implemented in EasyCodeML software [35]. The ratio of nonsynonymous (dN) and synonymous substitution (dS) (ω = dN/dS) was calculated using site model based on the four sitespecific models (M0 vs M3, M1a vs. M2a, M7 vs. M8, and M8a vs. M8) with likelihood ratio test (LRT) threshold of p < 0.05 elucidating adaptation signatures within the genome. Here, each CDS sequence was aligned using codons in the MAFFT-v7.409 program [36] and then concatenated using the program Concatenate Sequence. Comparing the four site-specific models, M7 vs. M8 (positive selection) was calculated to identify positive selection sites based on both ω and LRTs values. The methods of Bayes Empirical Bayes (BEB) analysis [37] and Naive Empirical Bayes (NEB) analysis were implemented in the M8 model to detect positive selection sites of the selected genes.

Phylogenetic Analysis
To estimate the phylogenetic position of Clethra within the Ericales, 80 CDSs of the chloroplast genomes of 40 species was used to construct phylogenetic tree by combining maximum likelihood (ML) with Bayesian (BI), including 12 families in Ericales and two outgroups (Cornus chinensis Wangerin and Hydrangea davidii Franch.) (Table S1). Each CDS sequence was aligned using the software MAFFT-v7.409 [36], and then, we removed the stop codons and discarded poor fragments in the Gbloks program [38] and concatenated in the program Concatenate Sequence. On one hand, the aligned sequences were used to select the best-fit model for the BI methods according to the Bayesian information criterion (BIC) in the ModelFinder program [39]. GTR+I+G+F was recommended to be the best-fit model and then the BI tree (1,000,000 generations, sampling every 1000 generations) was constructed by the software MrBayes-3.2.6 [40], in which the initial 25% of sampled data were discarded as burn-in. On the other hand, the best fit model GTR+F+I+G4 was selected for maximum likelihood (ML) analysis by ModelFinder program according to BIC. The software IQ-TREE was implemented to construct the ML tree with 1000 bootstrap replications [41][42][43], which is performed in PhyloSuite-v1.2.1 [44]. The ML tree and Bayesian tree were manually recombined using AI software based on the consistent topological structures. The constructed phylogenetic tree was visualized using the software Figtree-v1.4.4 (https://www.figtreeasia.com/, (accessed on 1 April 2021)).

Chloroplast Genome Features
The complete chloroplast genome of C. fargesii is circular DNA of 157,486 bp in length with a quadripartite structure typical of most angiosperms ( Figure 1). It comprises a large single-copy (LSC) region of 87,034 bp, a small single-copy (SSC) region of 18,492 bp, and a pair of inverted repeat (IRa and IRb) regions of 25,980 bp each, which separate the LSC and SSC regions ( Figure 1 and Table 1). The GC content of the whole genome is 37.3%, while the GC contents in LSC, SSC, and IR regions are 35.4%, 30.7%, and 43.0%, respectively ( Table 1). The higher GC content in IR regions may be due to the presence of tRNA and rRNA genes that dominate the majority of the regions and have a relatively higher GC content. The GC content is an important indicator to distinguish different species groups [45,46] and higher GC content contributes to maintaining the stability of the DNA strands [47]. According to Liu et al. and Kim et al., the GC content in the IR regions of two Ericaceae species (Rhododendron griersonianum Balf. f. & Forrest and Vaccinium oldhamii Miquel) is significantly lower than that of the other six species (Table 1) [48,49], which may lead to more variations in length and gene number in their chloroplast genomes than other species in Ericales. Table 1. Features of the complete chloroplast genomes of C. fargesii and seven related species in Ericales order.

Category
Groups of Genes Name of Genes The chloroplast genome of C. fargesii is most similar to C. delavayi and highly resemblant to Styrax japonicus Sieb. et Zucc. and Schima superba Gardn. et Champ. in genome size, GC content, number of genes, and introns, in general, by comparing the structure features among eight Ericales species (Table 1 and Table S3). Although the Ericaceae (R. griersonianum and V. oldhamii) and the Actinidiaceae (Saurauia tristyla DC. and Actinidia eriantha Benth.) are two adjacent clades of the Clethraceae clade in the phylogenetic relationship while the other two species are in the farther clades [55], their chloroplast genome features displayed more differences in the expansion and contraction of whole chloroplast genomes and four regions. The contraction of IR regions leads to the expansion of SSC regions for two Actinidiaceae species, while the extreme expansion of IR regions reduces the length of the SSC regions to about 2000-3000 bp for two Ericaceae species (Table 1). Besides, the clpP gene is a typical phenomenon of gene loss in the structural variation of the chloroplast genome [56,57], which has been lost in the two Ericaceae species and two Actinidiaceae species while still present in S. japonicus and S. superba. Therefore, Clethra species appear to have conserved chloroplast genome structures based on the above analysis.

Comparative Chloroplast Genome Structure
The multiple genome alignments revealed significant structural differences for eight Ericales species using the Mauve program. Two Ericaceae species are longer than the other six species in the genome length due to the expansion of IR regions ( Figure 2). Some inversion and rearrangement regions were observed in the LSC region of two Ericaceae species, while other regions showed a lower variation ratio in gene structure. The two Ericaceae species may be prone to the structural mutations in DNA strands and accumulate constantly due to the low GC content in their chloroplast genomes under the long evolutionary history. However, the two Clethra species and the other four Ericales species are more conserved, which leads to a growing divergence in the genome structure of them from that of the two Ericaceae species. Except for two Ericaceae species, whole chloroplast genome from that of the two Ericaceae species. Except for two Ericaceae species, whole chloroplast genome alignment revealed relatively low variation and high similarity between two Clethra species and the other four Ericales species.

Junction Characteristics
The size of chloroplast genomes varies depending on the size variation of the LSC and IR regions [58], especially in the expansion and contraction of two IR regions [59,60]. Detailed comparison of the IR/SSC and IR/LSC junction regions among six species are presented in Figure 3. The junction characteristics of two Clethra species are also highly similar to S. japonicus and S. superba. Among the four species, the rps19 gene is located within the IRb/LSC boundary and expands to the IRb region with 32, 44, 46, and 67 bp, and the rpl2 gene is duplicated in the IR region close to two IR/LSC boundaries. Although the case in two Actinidiaceae species is different from the above four species, the trnH-GUG gene is duplicated in the IR region close to two IR/LSC boundaries, while in the other four species, the trnH-GUG gene is completely located to the right of the IRa/LSC boundary. Moreover, in two Actinidiaceae species, the psbA gene is located within the IRa/LSC boundary with 238 and 60 bp in IRa region, and the left of the IRb/LSC boundary is the rpl23 gene. Additionally, the ycf1 gene occupies the IRa/LSC boundary, with a length ranging from 404 to 1078 bp in IRa region, a corresponding ycf1 pseudogene in IRb boundary among the six species. The ndhF gene among five species is entirely located to the right of the IRb/SSC boundary except that 24 bp expands to the IRb regions in S. japonicus. Based on the above results, the two Clethra species have more resemblances to S. japonicus and S. superba in the expansions and contractions of IR region boundaries than two Actinidiaceae species.

Junction Characteristics
The size of chloroplast genomes varies depending on the size variation of the LSC and IR regions [58], especially in the expansion and contraction of two IR regions [59,60]. Detailed comparison of the IR/SSC and IR/LSC junction regions among six species are presented in Figure 3. The junction characteristics of two Clethra species are also highly similar to S. japonicus and S. superba. Among the four species, the rps19 gene is located within the IRb/LSC boundary and expands to the IRb region with 32, 44, 46, and 67 bp, and the rpl2 gene is duplicated in the IR region close to two IR/LSC boundaries. Although the case in two Actinidiaceae species is different from the above four species, the trnH-GUG gene is duplicated in the IR region close to two IR/LSC boundaries, while in the other four species, the trnH-GUG gene is completely located to the right of the IRa/LSC boundary. Moreover, in two Actinidiaceae species, the psbA gene is located within the IRa/LSC boundary with 238 and 60 bp in IRa region, and the left of the IRb/LSC boundary is the rpl23 gene. Additionally, the ycf1 gene occupies the IRa/LSC boundary, with a length ranging from 404 to 1078 bp in IRa region, a corresponding ycf1 pseudogene in IRb boundary among the six species. The ndhF gene among five species is entirely located to the right of the IRb/SSC boundary except that 24 bp expands to the IRb regions in S. japonicus. Based on the above results, the two Clethra species have more resemblances to S. japonicus and S. superba in the expansions and contractions of IR region boundaries than two Actinidiaceae species.

Codon Usage Analysis
The relative synonymous codon usage (RSCU) was analyzed using the protein-coding regions of C. fargesii chloroplast genome and a total of 26,407 codons were identified. UUA (1.86%, leucine) is the most frequently used codon, while AGC (0.35%, Serine) was the least abundant. Serine (6.01%), arginine (6.00%), and leucine (6.00%) are the three most abundant amino acids, whereas methionine (1.00%) and tryptophan (1.00%) are the two least abundant amino acids. As the most abundant amino acid, a preference for leucine was also observed in other reported angiosperm chloroplast genomes [54,61], which have been frequently reported in other angiosperm taxa [62,63]. Interestingly, the majority of the RSCU values with A/T-ending codons are more than one (RSCU > 1.00), while the majority of the RSCU values with C/G-ending codons were less than one (RSCU < 1.00). Moreover, most of the amino acids had no less than two synonymous codons, except that methionine (AUG) and tryptophan (UGG) have no codon usage bias due to only one encoded codon (Table S4 and Figure 4).
The total of codons varied from 23,024 (A. eriantha) to 26,595 (S. superba) according to the comparative analysis of the RSCU among six species in Ericales. Among these codons, serine (5.99%-6.02%), arginine (6.00%-6.02%), and leucine (6.00%-6.01%) were also maximum value the same as C. fargesii, while codons for methionine (1.00%) and tryptophan (1.00%) were also minimum ( Table S4). The comparison of the RSCU value of the six species was almost no different ( Figure 5), which indicated that both two Clethra species and the other four species belonging to the same order have relative stability in codon usage bias.

Codon Usage Analysis
The relative synonymous codon usage (RSCU) was analyzed using the protein-coding regions of C. fargesii chloroplast genome and a total of 26,407 codons were identified. UUA (1.86%, leucine) is the most frequently used codon, while AGC (0.35%, Serine) was the least abundant. Serine (6.01%), arginine (6.00%), and leucine (6.00%) are the three most abundant amino acids, whereas methionine (1.00%) and tryptophan (1.00%) are the two least abundant amino acids. As the most abundant amino acid, a preference for leucine was also observed in other reported angiosperm chloroplast genomes [54,61], which have been frequently reported in other angiosperm taxa [62,63]. Interestingly, the majority of the RSCU values with A/T-ending codons are more than one (RSCU > 1.00), while the majority of the RSCU values with C/G-ending codons were less than one (RSCU < 1.00). Moreover, most of the amino acids had no less than two synonymous codons, except that methionine (AUG) and tryptophan (UGG) have no codon usage bias due to only one encoded codon (Table S4 and Figure 4).
The total of codons varied from 23,024 (A. eriantha) to 26,595 (S. superba) according to the comparative analysis of the RSCU among six species in Ericales. Among these codons, serine (5.99-6.02%), arginine (6.00-6.02%), and leucine (6.00-6.01%) were also maximum value the same as C. fargesii, while codons for methionine (1.00%) and tryptophan (1.00%) were also minimum ( Table S4). The comparison of the RSCU value of the six species was almost no different ( Figure 5), which indicated that both two Clethra species and the other four species belonging to the same order have relative stability in codon usage bias.

Analysis of Simple Sequence Repeats (SSRs) and Nucleotide Diversity
Simple sequence repeat (SSR), also known as microsatellite as an excellent molecular markers tool with highly reproductive, polymorphic, and reliable advantages, is widely used in plant species identification, phylogeny, and population genetic studies [64]. In this study, SSRs were done using MISA software to identify the potential repeats based on the chloroplast genome sequences of two Clethra species. A total of 73 SSRs were detected in the chloroplast genome of C. fargesii, with the sequence length of 8-19 bp and 3-14 repeats,

Analysis of Simple Sequence Repeats (SSRs) and Nucleotide Diversity
Simple sequence repeat (SSR), also known as microsatellite as an excellent molecular markers tool with highly reproductive, polymorphic, and reliable advantages, is widely used in plant species identification, phylogeny, and population genetic studies [64]. In this study, SSRs were done using MISA software to identify the potential repeats based on the chloroplast genome sequences of two Clethra species. A total of 73 SSRs were detected in the chloroplast genome of C. fargesii, with the sequence length of 8-19 bp and 3-14 repeats, including 18 mononucleotides (24.7%), 45 dinucleotides (61.6%), 3 trinucleotides (4.1%), 5 tetranucleotides (6.8%), 1 pentanucleotide (1.4%), and 1 hexanucleotide (1.4%). Contrastively, in the C. delavayi chloroplast genome, a total of 78 SSRs were detected with the sequence length of 8-19 bp and 3-19 repeats, of which mononucleotides are 23 (29.5%), dinucleotides are 45 (57.7%), trinucleotides are 3 (3.8%), tetranucleotides are 5 (6.4%), pentanucleotide is 1 (1.3%), and hexanucleotide is 1 (1.3%) (Figures 6A and 7). including 18 mononucleotides (24.7%), 45 dinucleotides (61.6%), 3 trinucleotides (4.1%), 5 tetranucleotides (6.8%), 1 pentanucleotide (1.4%), and 1 hexanucleotide (1.4%). Contrastively, in the C. delavayi chloroplast genome, a total of 78 SSRs were detected with the sequence length of 8-19 bp and 3-19 repeats, of which mononucleotides are 23 (29.5%), dinucleotides are 45 (57.7%), trinucleotides are 3 (3.8%), tetranucleotides are 5 (6.4%), pentanucleotide is 1 (1.3%), and hexanucleotide is 1 (1.3%) (Figures 6A and 7). Most of the SSRs types are mononucleotide and dinucleotide repeats, while the other complex SSRs are at lower frequencies in the two-chloroplast genome sequences, and these results conform with previous studies [65][66][67]. It is a particular case that the number of dinucleotide SSRs is significantly greater than mononucleotide SSRs and A/T (no C/G) is the only mononucleotide SSRs type in the two species ( Figures 6A and 7), which are rarely found in other species [68][69][70][71][72]. Of the four copy regions of two species chloroplast genomes, the SSRs mainly in the LSC regions (64.4% and 67.9%) and the proportion close to two-thirds of the total are largely beyond the corresponding proportion of LSC length occupying the whole chloroplast genomes ( Figure 6B). In addition, the SSRs proportion in the CDS region (36.5% and 32.1%) is far less than that in non-coding regions ( Figure  6C) and does not match with the proportion of CDS length occupying the whole chloroplast genomes. The SSRs were disproportionately found in the LSC and non-coding regions, which may have higher molecular marker potential for Clethra species. The identified SSRs will be useful in studying phylogeny and population genetics of the Clethra genus in the future. It is also revealed that the chloroplast genome sequences of the two Clethra species have high similarity based on SSR analysis.   Most of the SSRs types are mononucleotide and dinucleotide repeats, while the other complex SSRs are at lower frequencies in the two-chloroplast genome sequences, and these results conform with previous studies [65][66][67]. It is a particular case that the number of dinucleotide SSRs is significantly greater than mononucleotide SSRs and A/T (no C/G) is the only mononucleotide SSRs type in the two species ( Figures 6A and 7), which are rarely found in other species [68][69][70][71][72]. Of the four copy regions of two species chloroplast genomes, the SSRs mainly in the LSC regions (64.4% and 67.9%) and the proportion close to two-thirds of the total are largely beyond the corresponding proportion of LSC length occupying the whole chloroplast genomes ( Figure 6B). In addition, the SSRs proportion in the CDS region (36.5% and 32.1%) is far less than that in non-coding regions ( Figure 6C) and does not match with the proportion of CDS length occupying the whole chloroplast genomes. The SSRs were disproportionately found in the LSC and non-coding regions, which may have higher molecular marker potential for Clethra species. The identified SSRs will be useful in studying phylogeny and population genetics of the Clethra genus in the future. It is also revealed that the chloroplast genome sequences of the two Clethra species have high similarity based on SSR analysis.
The Pi value of nucleotide diversity was calculated to analyze the sequence divergence based on the chloroplast genome of C. fargesii and C. delavayi. The values of the entire genome sequence range from 0 to 0.01000 and the average value is 0.00110. The analysis displayed that the highest level of mean nucleotide diversity is 0.00225 (range: 0-0.00667) in the SSC regions, while the lowest level of mean nucleotide diversity is 0.00333 (range: 0-0.00024) in the IR regions. Comparing the four regions, the highest Pi value is found in the LSC region by having a comparison among four regions. These values of the LSC region vary from 0 to 0.01000, with a mean value of 0.00142. Furthermore, seven high divergent regions (Pi > 0.00600, the mean value = 0.00787) were detected, of which four are located in non-coding regions (trnH/psbA, psbE/petL, rps16/rpl36, and rps15/ycf1) while three are in CDS regions (atpH, rpl16, and rps3/rpl22), and the highest Pi value is in rpl16 region. Among the seven divergent regions, six existed in LSC regions and only rps15/ycf1 was in SSC regions ( Figure 8). The Pi value of nucleotide diversity was calculated to analyze the sequence divergence based on the chloroplast genome of C. fargesii and C. delavayi. The values of the entire genome sequence range from 0 to 0.01000 and the average value is 0.00110. The analysis displayed that the highest level of mean nucleotide diversity is 0.00225 (range: 0-0.00667) in the SSC regions, while the lowest level of mean nucleotide diversity is 0.00333 (range: 0-0.00024) in the IR regions. Comparing the four regions, the highest Pi value is found in the LSC region by having a comparison among four regions. These values of the LSC region vary from 0 to 0.01000, with a mean value of 0.00142. Furthermore, seven high divergent regions (Pi > 0.00600, the mean value = 0.00787) were detected, of which four are located in non-coding regions (trnH/psbA, psbE/petL, rps16/rpl36, and rps15/ycf1) while three are in CDS regions (atpH, rpl16, and rps3/rpl22), and the highest Pi value is in rpl16 region. Among the seven divergent regions, six existed in LSC regions and only rps15/ycf1 was in SSC regions ( Figure 8).  The nucleotide diversity of two Clethra species shows a lower Pi value compared to other previous analyzed genera [73,74] and indicates evolutionary conservation in the chloroplast genome. The IR regions have lower nucleotide diversity than the LSC and SSC regions similar to other reported results [75][76][77][78] owing to their stability and consistency [79]. Moreover, combining the above results, it is evident that the LSC region is less conserved and has high nucleotide diversity than the other three regions, especially the IR regions. The analysis of nucleotide diversity also deduces that the non-coding regions have higher development and utilization values. More significantly, all these regions could be potential molecular markers for DNA barcodes used in species identification in this genus and future phylogenetic analysis studies of the family Clethraceae.

Adaptive Evolution Analysis
The ratio of non-synonymous (dN) to synonymous substitutions (dS), dN/dS, has been widely used to evaluate the natural selection pressure and evolution rates of nucleotides in genes [80]. The dN/dS ratio > 1 specifies positive selection (adaptive evolution) while dN/dS ratio < 1 signifies negative selection (purifying evolution). Bayes Empirical Bayes (BEB) and naive empirical Bayes (NEB) analysis were executed to detect positively selected sites using the M8 model based on 80 CDSs of two Clethra species. In the BEB method, only 38 positively selected sites were detected, of which 37 sites represent the ycf1 gene and a site is in the rps4 gene. A total of 107 positively selected sites were detected in the NEB method, which were distributed in ycf1, ycf2, cemA, ndhB, and psbD genes (Table 3). Generally, a few genes have detected as positively selected sites by the analysis either using BEB method or NEB method for analysis, which means that the two Clethra species display less selection pressure and conserved evolution history.

Phylogenetic Analysis
The chloroplast genome sequences have been widely used to study the evolution and phylogenetic relationships in plants owing to the advance of sequencing technologies [81]. The phylogenetic position of Clethra in Ericales was shown using the 75 chloroplast CDSs of 40 species by combining the ML tree with the BI tree ( Figure 9). Both the posterior probabilities values of the BI tree and the bootstrap values of the ML tree are high displayed on the nodes generally. Also, the Clethraceae clade and Ericaceae clade form a sister group. The topological structure is highly consistent with the phylogenetic relationship of Ericales in the APG IV system [55]. In previous morphological classifications, the genus Clethra was removed from the Ericaceae and promoted to a new family Clethraceae closely related with Ericaceae since 1851 [8]. In recent years, molecular systematics study has revealed that Clethraceae has a sister relationship with Cyrillaceae and Ericaceae using molecular data, whether by partial DNA fragments [9,82] or chloroplast genome to construct phylogenetic trees [83]. The stable taxonomic position in Ericales firmly reconfirms the previous deduction of a high degree of the conserved chloroplast genome of Clethra.
phylogenetic trees [83]. The stable taxonomic position in Ericales firmly reconfirms the previous deduction of a high degree of the conserved chloroplast genome of Clethra.

Conclusions
In this study, we reported the complete chloroplast genome of C. fargesii and comparative analysis for the genome features of Clethra species with closely related species in Ericales for the first time. The number of annotated unique genes was 112, including 80 PCGs, 30 tRNAs, and 4 rRNAs. A total of 73 SSRs and 7 high Pi value sites of nucleotide diversity were identified, which could be useful as potential and exploitable molecular markers. Additionally, we accepted the hypothesis that the chloroplast genomes of Clethra species present a lower variation ratio and conserved evolution history based on the comparative analysis of genome features, genome alignment, junction characteristics, codon usage bias, SSRs, nucleotide diversity, and selection pressure. The phylogenetic analysis revealed that Clethraceae and Ericaceae are sister groups. The stable taxonomic position reconfirms the previous deduction that Clethra species have a highly conserved chloroplast genome, which means that C. fargesii could be used as a reference for annotating the chloroplast genome of other Ericales species. The genome information and above analysis in our study will provide a theoretical basis for future studies of Clethraceae in phylogeny, taxonomy, and biogeography.

Conclusions
In this study, we reported the complete chloroplast genome of C. fargesii and comparative analysis for the genome features of Clethra species with closely related species in Ericales for the first time. The number of annotated unique genes was 112, including 80 PCGs, 30 tRNAs, and 4 rRNAs. A total of 73 SSRs and 7 high Pi value sites of nucleotide diversity were identified, which could be useful as potential and exploitable molecular markers. Additionally, we accepted the hypothesis that the chloroplast genomes of Clethra species present a lower variation ratio and conserved evolution history based on the comparative analysis of genome features, genome alignment, junction characteristics, codon usage bias, SSRs, nucleotide diversity, and selection pressure. The phylogenetic analysis revealed that Clethraceae and Ericaceae are sister groups. The stable taxonomic position reconfirms the previous deduction that Clethra species have a highly conserved chloroplast genome, which means that C. fargesii could be used as a reference for annotating the chloroplast genome of other Ericales species. The genome information and above analysis in our study will provide a theoretical basis for future studies of Clethraceae in phylogeny, taxonomy, and biogeography.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/ 10.3390/f12040441/s1, Table S1: NCBI accession number of 40 chloroplast genomes, Table S2: The gene number of the chloroplast genome of C. fargesii in four regions, Table S3: Comparison of the kinds and length of introns among eight chloroplast genome sequences, Table S4: Comparison of the relative synonymous codon usage (RSCU) among six chloroplast genomes.