Comparative Analysis of the Complete Chloroplast Genomes of Four Chestnut Species ( Castanea )

: Chloroplast (cp) DNA genomes are traditional workhorses for studying the evolution of species and reconstructing phylogenetic relationships in plants. Species of the genus Castanea (chestnuts and chinquapins) are valued as a source of nuts and timber wherever they grow, and chestnut species hybrids are common. We compared the cp genomes of C. mollissima, C. seguinii, C. henryi , and C. pumila . These cp genomes ranged from 160,805 bp to 161,010 bp in length, comprising a pair of inverted repeat (IR) regions (25,685 to 25,701 bp) separated by a large single-copy (LSC) region (90,440 to 90,560 bp) and a small single-copy (SSC) region (18,970 to 19,049 bp). Each cp genome encoded the same 113 genes; 82–83 protein-coding genes, 30 transfer RNA genes, and four ribosomal RNA genes. There were 18 duplicated genes in the IRs. Comparative analysis of cp genomes revealed that rpl22 was absent in all analyzed species, and the gene ycf1 has been pseudo-genized in all Chinese chestnuts except C. pumlia . We analyzed the repeats and nucleotide substitutions in these plastomes and detected several highly variable regions. The phylogenetic analyses based on plastomes conﬁrmed the monophyly of Castanea species. Contributions: Conceptualization, methodology, writing—original


Introduction
Chestnut (Castanea Mill.) is a genus of the Fagaceae that includes twelve to seventeen species distributed in deciduous forests in eastern North America, Europe, and Asia [1]. Chestnuts and the other members of the genus are ecologically and economically important nut and wood producing trees [2]. Four Castanea species are Asian; three are endemic to China: Chinese chestnut (Castanea mollissima Bl.), Seguin chestnut (Castanea Seguinii Dode), and Pearl chestnut or Henry chestnut (Castanea henryi (Skan) Rehd. et Wils.) [3,4]. These species grow across a broad range in China [5], from Jilin province in the North (40 • N) to Hainan Island in the South (18 • N), a range that includes a cold temperate zone, temperate zone, and subtropical zone, and 50 to~2800 m a.s.l. [6]. Chinese chestnut is also widely cultivated around the world as a nut tree [2]. The most important chestnut species in terms of nut commerce are Chinese chestnut, European chestnut (Castanea sativa Mill.), and Japanese chestnut (Castanea crenata Sieb. et Zucc.) [6,7]. C. seguinii and C. henryi (sometimes called willow leaf or pearl chestnut) are cultivated in small quantities and are mainly used for variety improvement in breeding programs [8]. China is the world's largest producer of edible chestnuts, and these chestnuts have been roasted over a hot flame as an attraction and food for tourists and locals. The cultivation of C. mollissima began 2000-3000 years ago [9], probably in China, but the species' origins and genetic diversity are not well-characterized compared to many crops [10][11][12].

Illumina Sequencing, Assembly, and Annotation
Castanea cp genomes were sequenced using the Illumina Hiseq-PE150 sequencing platform (Novogene, Beijing, China). We used the program MITObim v1.7 to perform the reference-guided assemblies [38]. We used the complete chloroplast genome sequence of C. mollissma (GenBank number: HQ336406) as reference [28]. Annotations were performed using the online program Dual Organellar Genome Annotator (DOGMA) [39]. We adjusted the start and stop codons and boundaries between introns and exons using MAFFT v7.0.0 by comparison with homologous genes from other cp genomes [40]. Genes were identified using Geneious v8.0.2, which was also used to identify open reading frames (ORF) that were not previously annotated or identified [41]. The final circular Castanea cp genome map was drawn by Organellar Genome DRAW [42] (Figure 1).  The grey arrows indicate the direction of transcription of the two DNA strands. A GC-content graph is depicted within the inner circle. The circle inside the GC content graph marks the 50% threshold. The maps were created using Organellar Genome DRAW [42].

Repeat Analysis
REPuter software was used to identify and visualize repeat structures [43]. Minimal repeat size was 30 bp, and the default identity of repeat structures was set to 90% similarity. Tandem Repeat Finder was used to locate and display tandem repeats in DNA sequences (>10 bp in length), with the alignment parameters "match, mismatch, and indels" set at 2, 7, and 7, respectively [44]. We used MISA to detect the microsatellites (SSR) in all 8 chloroplast genomes, including mono-(p1, one motif), di-(p2, two motifs), tri-(p3, three motifs), tetra-(p4, four motifs), penta-(p5, five motifs), hexa-(p6, six motifs) and polynucleotide (complex motifs) repeats [45]. MISA reported the type and location of each microsatellite and submitted the DNA sequence to Primer 3, a software that predicted primer sequence, position, melting temperature, and the expected PCR product size.

Hypervariable Hotspot Identification
We compared the four Castanea cp genomes to determine their average pairwise sequence divergence. Genomes were aligned using MAFFT v7, assuming collinearity [40]. Variable and parsimony-informative base sites were identified using MEGA 7.0 software [46]. The R package SPIDER version 1.5.0 was used to compare Castanea cp and to identify hypervariable cp regions using slideAnalyses, with C. mollissima as a reference [47]. We aligned all eight Castanea cp genomes to extract the continuous windows of a chosen size (800 bp and 200 bp) and perform pairwise distance (K2P) analyses of each window. The slideAnalyses function in SPIDER was used to analyze hypervariable regions, using the mean distance between each window across the hypervariable regions and the proportion of zero pairwise distance for each species in the matrix. Selective pressure non-synonymous (K A ) and synonymous (K S ) substitution rates (K A /K S ) were computed with the codeml tool from PAML v4.0, using a YN00 model to test every gene sequence [48].

Analysis of DNA Barcodes
We characterized the hypervariable regions and the universal barcode regions rbcL, matK, and trnH-psbA using tree-building and K-mer statistics within the bbsik function in Barcoding R [49]. We built Maximum Likelihood (ML) trees for each hypervariable region and for region combinations using MEGA 7.0 [46].

Phylogenetic Analyses
Our phylogenetic analysis was based on five data sets: (1) the complete cp DNA sequences, (2) the protein coding sequences, (3) the LSC region, (4) the SSC region, and (5) the IR region. Four species (Quercus spinosa, Q. aliena, Trigonobalanus doichangensis, and Castanopsis echinocarpa) were used as outgroups. Trees were calculated using Maximum Likelihood (ML), Maximum Parsimony (MP), and Bayesian (BI) methods [50][51][52][53]. All of the Castanea cp genome sequences from the finalized dataset were aligned with MAFFT v7.0.0 [40]. The GTRAGMMA model was selected based on output from analysis of all five datasets using MODELTESTv3.7 software [54]. The Maximum Likelihood (ML) phylogenetic tree was produced using the GTR GAMMA model in RAxML v8.0 [55]. For all analyses, 10 independent ML searches were conducted, bootstrap support was estimated with 1000 bootstrap replicates, and bootstrap proportions were drawn on the tree with highest likelihood score from the 10 independent searches. Maximum Parsimony (MP) phylogenetic analyses were performed in PAUP4 using 1000 bootstrap replicates. BI trees were produced using a GTR GAMMA model in MrBayes v3.2.6, set to 1,000,000 generations and stopval = 0.01 with one cold and three incrementally heated Markov Chain Monte Carlo (MCMC) run simultaneously in two parallel runs sampling every 1000 generations [56,57]. The first 25% of the trees were discarded as burn-in. The remaining trees were used to generate the consensus tree. To further explore the phylogeny of the Castanea within the Fagales, we downloaded from NCBI the plastid genome sequence of 35 species in the Fagaceae, Betulaceae, Juglandaceae, and Myricaceae (Table S1), and constructed ML and BI trees along with their support rates.

Characterization and Annotation of the Castanea cp Genomes
We obtained about 24 M raw reads for each genotype; coverage of each genome exceeded 150× (Table 1). After assembly and annotation, the cp genomes of Castanea showed a typical tetrad structure, consisting of a pair of inverted repeats (IRs) from 25,685 bp to 25,701 bp, a long single copy region (LSC) 90,440 bp to 90,560 bp, and a short single copy region (SSC) 18,970 bp to 19,049 bp. The chloroplast genomes of the five Castanea species ranged from 160,805 bp (C. mollissima 2) to 161,010 bp (C. seguinii2) (Figure 2), and the average GC content was 36.76%. All five species' cp genomes contained 130 functional genes, including 37 genes encoding transfer RNA (tRNA) (seven genes in the IR region), and eight genes encoding ribosomal RNA (rRNA) (all located in IR region) ( Table 1). were used to generate the consensus tree. To further explore the phylogeny of the Castanea within the Fagales, we downloaded from NCBI the plastid genome sequence of 35 species in the Fagaceae, Betulaceae, Juglandaceae, and Myricaceae (Table S1), and constructed ML and BI trees along with their support rates.

Characterization and Annotation of the Castanea cp Genomes
We obtained about 24 M raw reads for each genotype; coverage of each genome exceeded 150× (Table 1). After assembly and annotation, the cp genomes of Castanea showed a typical tetrad structure, consisting of a pair of inverted repeats (IRs) from 25,685 bp to 25,701 bp, a long single copy region (LSC) 90,440 bp to 90,560 bp, and a short single copy region (SSC) 18,970 bp to 19,049 bp. The chloroplast genomes of the five Castanea species ranged from 160,805 bp (C. mollissima 2) to 161,010 bp (C. seguinii2) (Figure 2), and the average GC content was 36.76%. All five species' cp genomes contained 130 functional genes, including 37 genes encoding transfer RNA (tRNA) (seven genes in the IR region), and eight genes encoding ribosomal RNA (rRNA) (all located in IR region) ( Table 1). The protein-coding gene rpl22 was absent in all chestnut species in this study. The pseudogene ndhD was also found in C. mollissima 1 and C. mollissima2, and the pseudogene ndhK was found in C. henryi1 (Table 2). The protein-coding gene rpl22 was absent in all chestnut species in this study. The pseudogene ndhD was also found in C. mollissima 1 and C. mollissima2, and the pseudogene ndhK was found in C. henryi1 (Table 2). Table 2. Gene contents in eight Castanea individuals' cp genomes.

Category of
Genes Gene Group Gene Name

Repeat and Simple Sequence Repeats Analyses
Castanea cp genomes contained numerous long repeats including forward repeats, complement repeats, reverse repeats, and palindromic repeats of at least 30 bp with a sequence identity ≥90% (Figure 3, Table S2). We observed 13 long repeats in one sample of C. mollissima and 14 long repeats in another, 16 forward repeats in one sample of C. henryi and 19 forward repeats in another, and 17 long repeats in C. pumila. In C. seguinii, reverse repeats were the most frequent. There was no difference among the species for the number of palindromic and reverse repeats (Figure 3). Each Castanea cp genome contained 7 to 13 tandem repeats ( Figure 3, Table S3).
The cp genome of each Castanea species contained 108-120 SSRs at least 10 bp in length ( Figure 3, Table S4). The number of SSRs of C. mollissima and C. henryi was similar. Most SSRs were located in noncoding parts of the LSC/SSC (about 95% of the total occurrences); SSRs located in coding regions were mainly in psbA, rpoC2, rpoB, ndhK (a pseudogene in C. henryi), atpB, accD, ndhD (a pseudogene in C. mollissima), ndhF, and ycf1 (a pseudogene in all Castanea evaluated).
The number and type of SSRs was similar for all Castanea species. For each species, mono-, di-, tri-, tetra-, penta-, hexa-and complex nucleotide SSRs were all detected ( Figure 3). The mononucleotide, complex nucleotide, and dinucleotide SSRs accounted for about 90% of SSR loci. SSRs in Castanea cp genomes were especially rich in AT; nearly all mononucleotide repeats were A/T. The distribution of SSRs was also similar in all four species, approximately 70% of repeats were found in the intergenic region (IGS), 16% were in introns, and the fewest were found in protein coding genes (CDS). We identified 37 SSR loci that were identically located in all eight genotypes of four species of Castanea. Of these, 17 showed sufficient sequence identity in the flanking region to permit identification of common primer pairs for amplification (Table S5). , and introns of each Castanea species. "p1", "p2", "p3", "p4", "p5", "p6" and "c" refer to mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, hexanucleotide, and complex nucleotide repeats.
The cp genome of each Castanea species contained 108-120 SSRs at least 10 bp in length ( Figure 3, Table S4). The number of SSRs of C. mollissima and C. henryi was similar. Most SSRs were located in noncoding parts of the LSC/SSC (about 95% of the total occurrences); SSRs located in coding regions were mainly in psbA, rpoC2, rpoB, ndhK (a pseudogene in C. henryi), atpB, accD, ndhD (a pseudogene in C. mollissima), ndhF, and ycf1 (a pseudogene in all Castanea evaluated).
The number and type of SSRs was similar for all Castanea species. For each species, mono-, di-, tri-, tetra-, penta-, hexa-and complex nucleotide SSRs were all detected ( Figure  3). The mononucleotide, complex nucleotide, and dinucleotide SSRs accounted for about 90% of SSR loci. SSRs in Castanea cp genomes were especially rich in AT; nearly all mononucleotide repeats were A/T. The distribution of SSRs was also similar in all four species, approximately 70% of repeats were found in the intergenic region (IGS), 16% were in introns, and the fewest were found in protein coding genes (CDS). We identified 37 SSR loci that were identically located in all eight genotypes of four species of Castanea. Of these, 17 showed sufficient sequence identity in the flanking region to permit identification of common primer pairs for amplification (Table S5).

Selective Pressures in the Evolution of Castanea
We evaluated 77 protein-coding genes in the four Castanea species' cp genomes for their synonymous and non-synonymous rates of all possible pairwise comparisons. We found three genes (matK, ndhD, and rpoC2,) under positive selection (K A /K S ratio >1) (Figure 4a). The K A /K S ratio for matK was 3.59. The K A /K S ratio for ndhD was 1.41 and for rpoC2 it was 1.18 ( Figure 4a). We found that these genes were not under selection pressure in all species. For example, in the process of pairwise alignment, the ndhB gene in C. pumila was under selection based on the K A /K S >1 criterion. The ndhB gene did not show significant selection pressure in Chinese Castanea species' cp genomes, but it has been positively selected in the American chestnut (C. dentata) ( Table S6). The value K A /K S for matK and ndhD indicated no evidence of selection based on pairwise comparison of Castanea species' cp genomes (Table S6). selected in the American chestnut (C. dentata) ( Table S6). The value KA/KS for matK and ndhD indicated no evidence of selection based on pairwise comparison of Castanea species' cp genomes (Table S6).

Phylogenetic Analyses
The phylogenies of Castanea were examined based on five datasets (the whole cp genome, protein-coding genes, the LSC region, the SSC region, and the IR region) from the eight Castanea cp genomes and four published Fagaceae species (used as outgroups) (Figure 6; Figure S1). Evolutionary trees based on the above data were constructed using maximum likelihood (ML), maximum parsimony (MP), and Bayesian (BI) methods. All five datasets and methods produced trees that were highly consistent ( Figure 6; Figure S1). The closer relationships of C. henryi and C. mollissima (versus C. seguinii or C. pumila) was clear.
The tree constructed to represent the whole Fagales was strongly supported ( Figure  6) and similar in topology, irrespective of method (ML or Bayesian). Each family clustered into monophyletic clades. The Fagaceae formed a distinct clade from the Myricaceae, Betulaceae, and Juglandaceae. The Myricaceae and Juglandaceae formed a clade distinct from

Phylogenetic Analyses
The phylogenies of Castanea were examined based on five datasets (the whole cp genome, protein-coding genes, the LSC region, the SSC region, and the IR region) from the eight Castanea cp genomes and four published Fagaceae species (used as outgroups) ( Figure 6; Figure S1). Evolutionary trees based on the above data were constructed using maximum likelihood (ML), maximum parsimony (MP), and Bayesian (BI) methods. All five datasets and methods produced trees that were highly consistent ( Figure 6; Figure S1). The closer relationships of C. henryi and C. mollissima (versus C. seguinii or C. pumila) was clear.
The tree constructed to represent the whole Fagales was strongly supported ( Figure 6) and similar in topology, irrespective of method (ML or Bayesian). Each family clustered into monophyletic clades. The Fagaceae formed a distinct clade from the Myricaceae, Betulaceae, and Juglandaceae. The Myricaceae and Juglandaceae formed a clade distinct from the Betulaceae. All of the analyzed families within Fagales had bootstrap values (BS) = 1 (Figure 6). The relationships among the samples based on the chloroplast datasets was the same as the system of classification proposed by APG IV [58].
the Betulaceae. All of the analyzed families within Fagales had bootstrap values (BS) = 1 ( Figure 6). The relationships among the samples based on the chloroplast datasets was the same as the system of classification proposed by APG IV [58].

Structural and Sequence Comparisons of cp Genomes in Castanea
The IR region of eight Castanea cp genomes was highly conserved, but we identified structure variation in the IR/SC boundary where rps19-rpl2-trnH and ycf1-ndhF were located. Rpl22 is non-functional in Castanea [28], but in some Fagaceae rpl22 is completely absent and in others it is functional. For example, of the four outgroups used in this study, the rpl22 gene was found in Quercus aliena, Quercus spinosa, and Castanopsis echinocarpa, but not in Trigonobalanus doichangensis.
In the C. henryi1 genome, ndhK pseudo-genized containing multiple stop codons in the protein coding region ( Figure S2a). In C. mollissima1 and C. mollissima2, the gene region encoding ndhD was not interrupted with stop codons, but it does not encode a functional protein, so the ndhD gene has evolved into a pseudogene in C. mollissima ( Figure S2b).

Structural and Sequence Comparisons of cp Genomes in Castanea
We sequenced the cp genomes of four genotypes of Chinese Castanea species and compared them to three published reference sequences. Our goal was to begin to characterize genetic variation in cp genomes both within and among Castanea species. Our results are foundational for future studies of Castanea, including the molecular identification, evolution, and breeding of Chinese or World Castanea species. We also identified and

Structural and Sequence Comparisons of cp Genomes in Castanea
The IR region of eight Castanea cp genomes was highly conserved, but we identified structure variation in the IR/SC boundary where rps19-rpl2-trnH and ycf1-ndhF were located. Rpl22 is non-functional in Castanea [28], but in some Fagaceae rpl22 is completely absent and in others it is functional. For example, of the four outgroups used in this study, the rpl22 gene was found in Quercus aliena, Quercus spinosa, and Castanopsis echinocarpa, but not in Trigonobalanus doichangensis.
In the C. henryi1 genome, ndhK pseudo-genized containing multiple stop codons in the protein coding region ( Figure S2a). In C. mollissima1 and C. mollissima2, the gene region encoding ndhD was not interrupted with stop codons, but it does not encode a functional protein, so the ndhD gene has evolved into a pseudogene in C. mollissima ( Figure S2b).

Structural and Sequence Comparisons of cp Genomes in Castanea
We sequenced the cp genomes of four genotypes of Chinese Castanea species and compared them to three published reference sequences. Our goal was to begin to characterize genetic variation in cp genomes both within and among Castanea species. Our results are foundational for future studies of Castanea, including the molecular identification, evolution, and breeding of Chinese or World Castanea species. We also identified and characterized the size and locations of repeat sequences within the cp genomes of four Chinese Castanea species. The structure and location of repeats is associated with genome evolution [26,59] and as such has phylogenetic relevance [25,26]. Extensive information about the nature and location of sequence variants in Chestnut could open the door to ultra-barcoding, a method where detailed sequence data can be used to trace the lineage, source, or adaptive phenotypes of a sample [24].
Most angiosperm chloroplasts contain 74 protein-coding genes; an additional three are found in a few species [25][26][27][28]. The four Castanea cp genomes we sequenced revealed 84 predicted protein-coding genes (77 unigenes were predicted to be protein coding). The protein-coding gene rpl22 was absent in all chestnut species in this study, which is consistent with Jansen et al. (2005) [28].

Phylogenetic Analysis
Our phylogenetic analysis based on whole cp genome sequence data was in concordance with the current phylogeny of Castanea and of the Fagales in general, irrespective of the method (MP, ML, BI) ( Figure 6) [25,26]. The strong genomic similarity of C. henryi and C. mollissima was confirmed. Based on morphology, C. seguinii and C. mollissima appear most closely related [6,7], but the genomic evidence is that the C. seguinii cp reflects an early diverging of Chinese Castanea species (Figure 6). The accepted phylogeny of Chinese Castanea also reflects their current geospatial distribution, and there is no evidence for natural hybrid zones [59].

DNA Barcode Development
DNA barcodes have been widely used in evolutionary and phylogenetical studies for both plants and animals [49]. We compared cp genomes of Castanea using slideAnalyses to calculate genetic distance across a moving 800 bp window [47]. This method identified five regions of the Castanea genome that might be useful for barcoding: trnK-UUU-rps16, psbM-trnD-GUU, rbcL-accD, petA-psbJ, and rpl2 (Figures 4 and 5). We suggest the sequence of rbcL, matK, petA-psbJ as a barcode. The lengths of the three selected sequences are respectively 1428 bp, 1508 bp, and 1072 bp. There are 4, 14, and 75 mutation sites and 4, 10, and 68 informative sites within these loci. The barcode would be useful for distinguishing American chestnut (C. dentata) and Chinese chestnuts, and as such could provide a method to track the source of hybrids used in breeding. Simple sequence repeats (SSRs) were highly variable among and within Castanea species. As such, these have value for population genetics and other applications requiring high levels of polymorphism.

Conclusions
In this study, we sequenced and compared the complete chloroplast genomes of eight genotypes from four Castanea species. The length of the chloroplast genomes ranged from 160,805 bp to 161,010 bp. Comparative analysis revealed that rpl22 was absent in all analyzed species and the gene ndhK has been pseudo-genized in all Chinese chestnuts except C. pumlia. High levels of genetic variation at trnK-UUU-rps16, psbM-trnD-GUU, rbcL-accD, petA-psbJ, and rpl2 make these regions excellent candidates for barcode development within chestnut taxa.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/f12070861/s1. Figure S1. Phylogeny trees of eight Castanea individuals plus four taxa based on (a) protein-coding genes (CDS), (b) inverted repeats (IR) regions, (c) large single-copy (LSC) region, and (d) small single-copy (SSC) regions. Bootstrap values (%) are shown above branches. Figure S2. The genetic variation of the pseudogene ndhK and ndhD in the eight Castanea chloroplast genomes. (a) The pseudogene ndhK in Castanea henryi1 chloroplast genome. (b) Alignment of pseudogene ndhD in the eight Castanea chloroplast genomes. Table S1. List of chloroplast genome sequences included in the phylogenetic analyses. Table S2. Analysis of palindromic repeats and dispersed repeats in eight Castanea chloroplast genomes. Table S3. Analysis of tandem repeats in eight Castanea chloroplast genomes. Table S4. List of simple sequence repeats in eight Castanea chloroplast genomes. Table S5. Primer sequences for SSR loci. Table S6. K A /K S ratio for protein coding sequences for eight Castanea individuals. Table S7. Indel and single nucleotide polymorphisms (SNP) in the eight Castanea chloroplast genomes.