Complete Chloroplast Genome Sequence and Phylogenetic Analysis of Quercus acutissima

Quercus acutissima, an important endemic and ecological plant of the Quercus genus, is widely distributed throughout China. However, there have been few studies on its chloroplast genome. In this study, the complete chloroplast (cp) genome of Q. acutissima was sequenced, analyzed, and compared to four species in the Fagaceae family. The size of the Q. acutissima chloroplast genome is 161,124 bp, including one large single copy (LSC) region of 90,423 bp and one small single copy (SSC) region of 19,068 bp, separated by two inverted repeat (IR) regions of 51,632 bp. The GC content of the whole genome is 36.08%, while those of LSC, SSC, and IR are 34.62%, 30.84%, and 42.78%, respectively. The Q. acutissima chloroplast genome encodes 136 genes, including 88 protein-coding genes, four ribosomal RNA genes, and 40 transfer RNA genes. In the repeat structure analysis, 31 forward and 22 inverted long repeats and 65 simple-sequence repeat loci were detected in the Q. acutissima cp genome. The existence of abundant simple-sequence repeat loci in the genome suggests the potential for future population genetic work. The genome comparison revealed that the LSC region is more divergent than the SSC and IR regions, and there is higher divergence in noncoding regions than in coding regions. The phylogenetic relationships of 25 species inferred that members of the Quercus genus do not form a clade and that Q. acutissima is closely related to Q. variabilis. This study identified the unique characteristics of the Q. acutissima cp genome, which will provide a theoretical basis for species identification and biological research.


Introduction
Oak trees provide humans with materials used in food, clothing, and houses, while oak forests supply living organisms and animals with comfortable habitats, good air, and sufficient and pure moisture. Oak trees are linked to Chinese culture, and are also often called eucalyptus or pecking trees. In China, eucalyptus is regarded as a mysterious tree, growing silently, watching its ancestors forge ahead, and passing through generation to generation. Many countries regard oaks as sacred trees, and consider them to be magical and a symbol of longevity, strength, and pride.
The genus Quercus L. (Oak) contains more than 400 species that are widespread in the northern hemisphere [1]. These species play important roles in China's forest ecosystem. Quercus L. (Oak)'s taxonomy, genetic structure, and breeding is complicated because of its wide variety of species, diverse forms, complex habitat conditions, and gene exchanges between species. Many studies have used nuclear simple sequence repeat (SSR) chloroplast DNA makers to study phylogeny and population variation [2,3]. Previously, studies found a conflict (inconsistency) between the phylogeny of plastid data and nuclear data in Senecioneae and Neotropical Catasetinae [4,5]. Therefore, it is not sufficient to study Quercus simply by using plastid regions. With the rapid development of next-generation sequencing, genome acquisition is now cheaper and faster than traditional Sanger sequencing. Complete chloroplast (cp) genome size data will be necessarily used to infer the phylogenetic relationship of Quercus or Fagaceae in future studies.
The genus is characterized by a high variability of morphological and ecological traits, the occurrence of mixed stands, the presence of large population sizes, and high levels of gene flow within the Quercus complex [6][7][8][9][10][11]. A new classification of Quercus L. was proposed by Denk with eight sections: Cyclobalanopsis, Cerris, Ilex, Lobatae, Quercus, Ponticae, Protobalanus, and Virentes [12]. In China, Quercus is divided into five morphology-based sections: Quercus, Aegilops, Heterobalanus, Engleriana, and Echinolepides [13][14][15]. Due to incomplete sampling and the use of markers with insufficient phylogenetic signals and complex evolutionary problems, the relationships among Quercus species are not fully understood.
Q. acutissima is an ecological and economic tree species in deciduous broad-leaved forests in the temperate zone of East Asia, widely distributed on the Hu Huanyong line or in Southeast China (latitude from 18 • to 41 • N and longitude from 91 • to 123 • E) [16]. This line from Heilongjiang Province to Tengchong, Yunnan Province, is roughly inclined in a 45 • straight line. The development, origin, and reproduction of China are linked with Q. acutissima. Therefore, we need to protect, cultivate, and utilize Q. acutissima, and this has received substantial attention in phylogeny and biogeography studies. Most previous studies have focused on its population structure [17], breeding [18], forest management [19], and physiology [20]. Studies on the genetic variation of Q. acutissima using simple sequence repeat (SSR) and cpDNA makers have been carried out in China and South Korea [16,21]. According to this research, the distribution of Q. acutissima often overlaps with other oak trees, i.e., Q. variabilis and Q. chenii [22]. There is often a variety of species found in the population, although this has usually been determined from a comparison of morphology, rather than at a molecular level. Therefore, an analysis of the complete cp genome of Q. acutissima will help to identify the species further.
In the present study, we constructed the whole chloroplast genome of Q. acutissima by using next-generation sequencing and applying a combination of de novo and reference-guided assembly. Here, we describe the whole chloroplast genome sequence of Q. acutissima and the characterization of long repeats and simple sequence repeats (SSRs). We compare and analyze the chloroplast genome of Q. acutissima and the chloroplast genome of other members of Fagaceae. It is expected that the results will provide a theoretical basis for the determination of phylogenetic status and future scientific research.

Features of Q. Acutissima cpDNA
A total number of 63 million pair-end reads were produced with 9.82 Gb of clean data. Data from all of the reads were deposited in the NCBI Sequence Read Archive (SRA) under accession number MH607377. The size of the complete cp genome is 161,124 bp ( Figure 1). The cp genome displayed a typical quadripartite structure, including a pair of IR (25,816 bp) separated by the large single copy (LSC; 90,423 bp) and small single copy (SSC; 19,069 bp) regions ( Figure 1 and Table 1). The DNA G + C contents of the LSC, SSC, and IR regions, and the whole genome are 34.62, 30.84, 42.78, and 36.08 mol %, respectively, which is also similar to the chloroplast genomes of other Quercus species ( Figure A1; Table 2). The DNA G + C content is a very important indicator of species affinity [23]. It is obvious that the DNA G + C content of the IR region is higher than that of other regions (LSC, SSC). This phenomenon is very common in other plants [23,24]. GC skewness has been shown to be an indicator of DNA lead chains, lag chains, replication origin, and replication terminals [25][26][27].  Plant chloroplast genomes may have 63-209 genes, but most are concentrated between 110 and 130, with a highly conserved composition and arrangement, including photosynthetic genes, chloroplast transcriptional expression-related genes, and some other protein-coding genes [28]. In the Q. acutissima chloroplast genome, 136 functional genes were predicted and divided into six groups, including eight rRNA genes, 40 tRNA genes, and 88 protein-coding genes (Tables 1 and 3). In addition, 14 tRNA genes, eight rRNA genes, and 15 protein-coding genes are duplicated in the IR regions ( Figure 1). The LSC region includes 62 protein-coding and 25 tRNA genes, while the SSC region includes 13 protein-coding genes (Table A1).  Plant chloroplast genomes may have 63-209 genes, but most are concentrated between 110 and 130, with a highly conserved composition and arrangement, including photosynthetic genes, chloroplast transcriptional expression-related genes, and some other protein-coding genes [28]. In the Q. acutissima chloroplast genome, 136 functional genes were predicted and divided into six groups, including eight rRNA genes, 40 tRNA genes, and 88 protein-coding genes (Tables 1 and 3). In addition, 14 tRNA genes, eight rRNA genes, and 15 protein-coding genes are duplicated in the IR regions ( Figure 1). The LSC region includes 62 protein-coding and 25 tRNA genes, while the SSC region includes 13 protein-coding genes (Table A1).
Based on the protein-coding sequences and tRNA genes, the frequency of codon usage was estimated for the Q. acutissima cp genome and is summarized in Table A2. In total, all genes are encoded by 6311 codons. Among these, leucine, with 2824 (44.4%) codons, is the most frequent amino acid in the cp genome, and cysteine, with 293 (1.1%), is the least frequent (Table 3). A-and U-ending codons are common. The most preferred synonymous codons (relative synonymous codon usage values (RSCU) > 1) end with A or U [23,29].  Table 3. List of genes annotated in the cp genomes of Q. acutissima sequenced in this study.
In total, we found 23 intron-containing genes, including 15 protein-coding genes, and eight tRNA genes (Table 4). 21 genes (13 protein-coding and eight tRNA genes) contain one intron, and two genes (ycf3 and clpP) contain two introns. The trnK-UUU has the largest intron (2505 bp), and the trnL-UAA has the smallest intron (483bp). Studies have shown that ycf3 is required for stable accumulation of photosystem I complexes [30]. Therefore, we speculate that the ycf3 intron gain of Q. acutissima may be helpful for further study of the mechanism of photosynthesis evolution. Table 4. The lengths of exons and introns in genes with introns in the Q. acutissima chloroplast genome.

Comparative Analysis of Genomic Structure
The chloroplast sequence are often used to measure the genetic diversity within a species, the gene flow between species, and the size of ancestral populations of separated sister species [31]. Thus, it is necessary to understand the chloroplast differences between species. The complete cp genome sequence of Q. acutissima was compared to those of Q. variabilis, Q. dolicholepis, Castanea mollissima, Lithocarpus balansae, and Fagus engleriana. F. engleriana has the smallest cp genome with the largest IR region (51,784 bp), and Q. dolicholepis has the largest cp genome (Table 1). We assumed that the different lengths of the SSC and IR regions is the main reason for variety in sequence lengths. To verify the possibility of genome divergence, sequence identity was calculated for six species' chloroplast DNA using the program mVISTA with Q. variabilis as a reference ( Figure 2). The results of this comparison revealed that LSC regions are more divergent than SSC and IR regions and that higher divergence is found in noncoding than in coding regions. The complete cp genome sequence of F. engleriana is quite different from the five other plants. There was no significant difference between the chloroplast genome sequences of evergreen and deciduous trees. At the same time, the results of the sliding window indicated that the location of the variation in the cp genome among the six species occurred in the LSC and SSC regions ( Figure A2). Significant variation was found in coding regions of some genes, including psbI, rpl33, petB, rpl2, rps16, rpoC2, ndhK, ycf2, ycf1, and ndhI. The highest divergence in noncoding regions was found in the intergenic regions of trnK-rps16, rps 16-trnQ, psbK-psbI, trnS-trnG, atpH-atpI, atpI-rps2, rpoB-trnC, trnC-petN, psbM-trnD, trnD-trnY, trnE-trnM, trnT-petD, psbZ-trnG, trnT-trnL, trnF-ndhJ, rbcL-accD, psaI-ycf4, ycf4-cemA, petA-psbL, psaJ-rpl33, clpP-psbB, rpl14-rpl16, ndhF-rpl32, ccsA-ndhD, ndhD-psaC, and rps15-ycf1.
The contraction and expansion of the IR region at the borders play important roles in evolution. They are common evolutionary events and a major cause of changes in the size of the chloroplast genome. They may also cause variation in the length of angiosperm plastid genome [32][33][34]. Detailed comparisons of the IR-SSC and IR-LSC boundaries among the cp genomes of the above six Fagaceae species were presented in Figure 3. The IR regions are relatively highly conserved in the Quercus genus-the rpl2 gene in the Quercus cp genome is shifted by 62 bp from IRb to LSC at the LSC/IRb border, and by 62 bp from IRa to LSC at the IRa/LSC border. Compared to other species in the genus, the range of the IRa/SSC regions changes greatly. Compared with evergreen and deciduous species, we found significant differences in IRb/SSC. Some reports showed that ycf1 is necessary for plant viability and encodes Tic214, an important component of the Arabidopsis TIC complex [35,36]. The ycf1 gene crossed the SSC/IRb region, with 1041bp of ycf1_like within IRb (incompletely duplicated in IRb). The SSC/IRa junction is located in the ycf1 region in all Fagaceae species chloroplast genomes and extends into the SSC region by different lengths depending on the genome (Q. acutissima, 4619 bp; Q. variabilis, 4620 bp; Q. dolicholepis, 4611 bp; C. mollissima, 4623 bp; L. balansae, 4626 bp; F. engleriana, 4633 bp); the IRa region includes 1041, 1041, 1068, 1059, 828, and 1049 bp of the ycf1 gene.
border, and by 62 bp from IRa to LSC at the IRa/LSC border. Compared to other species in the genus, the range of the IRa/SSC regions changes greatly. Compared with evergreen and deciduous species, we found significant differences in IRb/SSC. Some reports showed that ycf1 is necessary for plant viability and encodes Tic214, an important component of the Arabidopsis TIC complex [35,36]. The ycf1 gene crossed the SSC/IRb region, with 1041bp of ycf1_like within IRb (incompletely duplicated in IRb). The SSC/IRa junction is located in the ycf1 region in all Fagaceae species chloroplast genomes and extends into the SSC region by different lengths depending on the genome

Long-Repeat and SSR Analysis
For the repeat structure analysis (Table 5), 31 forward and 22 inverted repeats were detected in the Q. acutissima cp genome. Most of these repeats are between 19 and 46 bp. The longest forward repeat is 46 bp in length and is located in the LSC region. A total of 35, 18, and eight repeats were found in the LSC, SSC, IR regions, respectively. Seven forward repeats were located in IR, including one repeat associated with ycf1 genes and one repeat related to the trnV-UAC and trnA-UGC genes. Most repeats in the intergenic spacers are distributed in the LSC region. Ten repeats are distributed in the SSC region, and only four of them are in the intergenic spacers.
As chloroplast-specific SSRs are uniparentally inherited and are inclined to undergo slipped-strand mispairing, they are often used in population genetics, species identification, and evolutionary process research of wild plants [37,38]. In addition, chloroplast genome sequences are highly conserved, and the SSR primer for chloroplast genomes can be transferred across species and genera. Yoko et al. used six maternally inherited chloroplast (cpDNA) simple sequence repeat (SSR) markers to study the genetic variation in Q. acutissima [39]. In this study, a total of 65 SSRs were found in Q. acutissima, most of them distributed in LSC and SSC and partly distributed in IR. These included 61 mononucleotide SSRs (93.85%) and four dinucleotide SSRs (6.15%) ( Table 6). Compared with other Quercus species, fewer types of SSRs were identified in Q. acutissima [40]. Among them, two SSRs belonged to the C type, and the others all belonged to the A/T types. These results are consistent with the hypothesis that cpSSRs are generally composed of short polyadenine (polyA) or polythymine (polyT) repeats and rarely contain tandem guanine (G) or cytosine (C) repeats [41]. We also found that 12 SSRs were located in genes, and the remaining were all located in intergenic regions. These cpSSR markers could be used to examine the genetic structure, diversity, differentiation, and maternity in Q. acutissima and its relative species in future studies.

Long-Repeat and SSR Analysis
For the repeat structure analysis (Table 5), 31 forward and 22 inverted repeats were detected in the Q. acutissima cp genome. Most of these repeats are between 19 and 46 bp. The longest forward repeat is 46 bp in length and is located in the LSC region. A total of 35, 18, and eight repeats were found in the LSC, SSC, IR regions, respectively. Seven forward repeats were located in IR, including one repeat associated with ycf1 genes and one repeat related to the trnV-UAC and trnA-UGC genes. Most repeats in the intergenic spacers are distributed in the LSC region. Ten repeats are distributed in the SSC region, and only four of them are in the intergenic spacers.
As chloroplast-specific SSRs are uniparentally inherited and are inclined to undergo slipped-strand mispairing, they are often used in population genetics, species identification, and evolutionary process research of wild plants [37,38]. In addition, chloroplast genome sequences are highly conserved, and the SSR primer for chloroplast genomes can be transferred across species and genera. Yoko et al. used six maternally inherited chloroplast (cpDNA) simple sequence repeat (SSR) markers to study the genetic variation in Q. acutissima [39]. In this study, a total of 65 SSRs were found in Q. acutissima, most of them distributed in LSC and SSC and partly distributed in IR. These included 61 mononucleotide SSRs (93.85%) and four dinucleotide SSRs (6.15%) ( Table 6). Compared with other Quercus species, fewer types of SSRs were identified in Q. acutissima [40]. Among them, two SSRs belonged to the C type, and the others all belonged to the A/T types. These results are consistent with the hypothesis that cpSSRs are generally composed of short polyadenine (polyA) or polythymine (polyT) repeats and rarely contain tandem guanine (G) or cytosine (C) repeats [41]. We also found that 12 SSRs were located in genes, and the remaining were all located in intergenic regions. These cpSSR markers could be used to examine the genetic structure, diversity, differentiation, and maternity in Q. acutissima and its relative species in future studies.

Phylogenetic Analysis
Phylogenetic analysis was completed on an alignment of concatenated nucleotide sequences of all chloroplast genomes from 25 angiosperm species (Figure 4). We used the Bayesian inference (BI) method based on RAxML to build a phylogenetic tree, and Malus prunifolia and Ulmus gaussenii were used as the outgroup. Support is generally high for almost all relationships inferred from all chloroplast genome data based on BI methods (the support values have a range of 0.8956 to 1). It is noteworthy that the species in genus Quercus do not form a clade. Several evergreen tree species gather together to form one clade. Q. acutissima and Q. variabilis are sister species and are frequently mixed in Chinese endemic species; the second clade splits into two subclades. F. engleriana is in the top position, while Q. acutissima appears to be more closely related to Q. variabilis, Q. dolicholepis, and Q. baronii. In general, the topologies of the other branches (genus Fagus, Trigonobalanus, Lithocarpus, and Castanopsis) are almost the same based on two nuclear loci (ITS and CRC) [3].

Phylogenetic Analysis
Phylogenetic analysis was completed on an alignment of concatenated nucleotide sequences of all chloroplast genomes from 25 angiosperm species (Figure 4). We used the Bayesian inference (BI) method based on RAxML to build a phylogenetic tree, and Malus prunifolia and Ulmus gaussenii were used as the outgroup. Support is generally high for almost all relationships inferred from all chloroplast genome data based on BI methods (the support values have a range of 0.8956 to 1). It is noteworthy that the species in genus Quercus do not form a clade. Several evergreen tree species gather together to form one clade. Q. acutissima and Q. variabilis are sister species and are frequently mixed in Chinese endemic species; the second clade splits into two subclades. F. engleriana is in the top position, while Q. acutissima appears to be more closely related to Q. variabilis, Q. dolicholepis, and Q. baronii. In general, the topologies of the other branches (genus Fagus, Trigonobalanus, Lithocarpus, and Castanopsis) are almost the same based on two nuclear loci (ITS and CRC) [3].

Sampling, DNA Extraction, Sequencing, and Assembly
Q. acutissima was planted in Nanjing Forestry University and Zijin Mountain in Nanjing, China (32°04′ N, 118°48′ E; 32°04′ N, 118°50′ E), respectively. Fresh leaves were collected and wrapped in ice and immediately stored at −80 °C until analysis. Genomic DNA was isolated by the modified method CTAB [42]. Agarose gel electrophoresis and one drop spectrophotometer (OD-1000, Shanghai Cytoeasy Biotech Co., Ltd., Shanghai, China) were used to detect DNA integrity and quality. Shotgun libraries (250 bp) were constructed using pure DNA according to the manufacturer's instructions. Sequencing was performed with an Illumina Hiseq 2500 platform (Nanjing, China), yielding at least 9.82 GB of clean data for Q. acutissima. Firstly, all of the raw reads were trimmed by Fastqc. Next, we performed a BLAST analysis between trimmed reads and references (Q. variabilis and Q. dolicholepis) to extract cp-like reads. Finally, we used the chloroplast-like reads to assemble sequences using NOVOPlasty [43]. NOVOPlasty assembled part reads and stretched as far as possible until a circular genome formed. When the assembly result was within the expected range, the overlap was larger than 200 bp, and the assembly formed a ring.

Materials and Methods
3.1. Sampling, DNA Extraction, Sequencing, and Assembly Q. acutissima was planted in Nanjing Forestry University and Zijin Mountain in Nanjing, China (32 • 04 N, 118 • 48 E; 32 • 04 N, 118 • 50 E), respectively. Fresh leaves were collected and wrapped in ice and immediately stored at −80 • C until analysis. Genomic DNA was isolated by the modified method CTAB [42]. Agarose gel electrophoresis and one drop spectrophotometer (OD-1000, Shanghai Cytoeasy Biotech Co., Ltd., Shanghai, China) were used to detect DNA integrity and quality. Shotgun libraries (250 bp) were constructed using pure DNA according to the manufacturer's instructions. Sequencing was performed with an Illumina Hiseq 2500 platform (Nanjing, China), yielding at least 9.82 GB of clean data for Q. acutissima. Firstly, all of the raw reads were trimmed by Fastqc. Next, we performed a BLAST analysis between trimmed reads and references (Q. variabilis and Q. dolicholepis) to extract cp-like reads. Finally, we used the chloroplast-like reads to assemble sequences using NOVOPlasty [43]. NOVOPlasty assembled part reads and stretched as far as possible until a circular genome formed. When the assembly result was within the expected range, the overlap was larger than 200 bp, and the assembly formed a ring.

Genome Comparison
MUMmer [50] was used for pairing sequence alignment of the cp genome. The mVISTA [51] program was applied to compare the complete cp genome of Q. acutissima to the other published cp genomes of its related species, i.e., Q. variabilis (KU240009), Q. dolicholepis (KU240010), C. mollissima (HQ336406), L. balansae (KP299291), and F. engleriana (KX852398) with the shuffle-LAGAN mode [52], using the annotation of Q. variabilis as a reference.

Phylogenetic Analysis
Phylogenies were constructed by Bayesian inference (BI) analysis using the 25 cp genome of the Fagaceae species sequences from the NCBI Organelle Genome and Nucleotide Resources database. The sequences were initially aligned using MAFFT [53]. Then, the visualization and manual adjustment of multiple sequence alignment were conducted in BioEdit [54]. An IQ-tree was used to select the best-fitting evaluation of models of nucleotide sequences [55]. TVM + F + R4 and GTR + G were selected as the best substitution models for the BI analyses. BI analyses were conducted using Mrbayes [56]. Malus prunifolia (NC_031163), and the Ulmus gaussenii (NC_037840) were used as the outgroups.

Conclusions
In this study, we reported and analyzed the complete cp genome of Q. acutissima, an endemic and ecological tree species in China. The chloroplast genome was shown to be more conservative with similar characteristics to other genus Quercus species. Compared to the cp genomes of five other oak species, its LSC were shown to be more divergent among the four regions, and noncoding regions showed higher divergence. An analysis of the phylogenetic relationships among six species found Q. acutissima to be closely related to Q. variabilis. The developmental position of the tree in the Fagaceae family is consistent with previous studies. The results of this study provide an assembly of a whole chloroplast genome of Q. acutissima which might facilitate genetics, breeding, and biological discoveries in the future.