Comparison and Phylogenetic Analyses of Nine Complete Chloroplast Genomes of Zingibereae

: Zingibereae is a large tribe in the family Zingiberaceae, which contains plants with important medicinal, edible, and ornamental values. Although tribes of Zingiberaceae are well circumscribed, the circumscription of many genera within Zingibereae and the relationships among them remain elusive, especially for the genera of Boesenbergia , Curcuma , Kaempferia and Pyrgophyllum . In this study, we investigated the plastome variation in nine species representing ﬁve genera of Zingibereae. All plastomes showed a typical quadripartite structure with lengths ranging from 162,042 bp to 163,539 bp and contained 132–134 genes, consisting of 86–88 coding genes, 38 transfer RNA genes and eight ribosomal RNA genes. Moreover, the characteristics of the long repeats sequences and simple sequence repeats (SSRs) were detected. In addition, we conducted phylogenomic analyses of the Zingibereae and related taxa with plastomes data from additional 32 species from Genbank. Our results conﬁrmed that Stahlianthus is closely related to Curcuma , supporting the idea of merging it into Curcuma . Kaempferia , Boesenbergia and Zingiber were conﬁrmed as close relatives and grouped together as the Kaempferia group. Pyrgophyllum is not allied with the Curcuma clade but instead is embedded within the Hedychium clade. Our results demonstrate the power of plastid phylogenomics in improving the phylogenetic relationships within Zingibereae and provide a new insight into plastome evolution in Zingibereceae.


Introduction
Zingiberaceae, commonly known as the ginger family, is the largest family of the order Zingiberales. It comprises over 50 genera and consists of more than 1300 species [1], widely distributed throughout tropical Africa, Asia, and the Americas, with species abundant in South and Southeast Asia [2]. Many species of the ginger family are important ornamental, spice, or medicinal plants [3][4][5]. The first comprehensive phylogenetic analysis based on nuclear ITS region and plastid matK region confirmed the long-suspected complexity of generic concepts in Zingiberaceae and divided the Zingiberaceae family into six tribes and four subfamilies: Zingiberoideae (Zingibereae and Globbeae), Tamijioideae (Tamijieae), Siphonochiloideae (Siphonochileae) and Alpinioideae (Alpinieae and Riedelieae).
The tribe Zingibereae is a large subclade within the family Zingiberaceae and includes ca. 670 species in some 25 genera (Plants of the World Online: http://plantsoftheworldonline. org, IPNI: https://www.ipni.org, accessed on 21 January 2021) [1,6]. Members of Zingibereae are mainly distributed throughout tropical and warm-temperate Asia, with a few species extending to Pacific islands and Australia [1] (Plants of the World Online: http://plantsoftheworldonline.org, accessed on 15 January 2021). Members of Zingibereae are easily distinguished from other gingers by the plane of leaf distichy parallel to the direction of rhizome growth, large and petaloid lateral staminodes, trilocular ovary with axial, basal or free columnar placentation, and labellum usually not connate to the filament [1]. The fresh leaves were obtained from the nursery of the South China Botanical Garden in Guangzhou, China. The total genomic DNA was extracted by a modified CTAB protocol [30]. The libraries were sequenced on Illumina HiSeq Xten platform (Illumina, Inc., San Diego, CA, USA) at Sangon Biotech Co. Ltd. (Shanghai, China).

The Genomes of Plastome Assembly, Annotation and Structure
The raw reads of nine Zingibereae species were trimmed and filtered by NGSQC Toolkit version 2.3.3 [31]. The reads were de novo assembled using SPAdes v3.6.0 (54) and finished using PRICE (Paired-Read Iterative Contig Extension) [32]. The BWA was used to check the de novo assembly in default parameter and the reads were aligned against the assembled genome [33]. The automatic annotator software Unix Program Plann was used to annotate the genome [34]. The annotated genome was matched with open reading frames (ORFs), then the remaining lacking protein evidence ORFs were disregarded [35]. The genes were considered potential pseudogenes which contained one or more frame shift mutations or premature stop codons. In addition, the DRAW tool was used to generate and edited the circular map of the chloroplast genomes [36].

The Analysis of Codon Usage
The relative synonymous codon usage (RSCU) is used to represent the ratio of the specific and the expected codon frequency. RSCU > 1.00 indicates that a codon is used more frequently than expected, and vice versa. DAMBE5 is used to calculate the RSCU [37].

Complete Chloroplast Genome Comparison and Molecular Marker Identification
We used the mVISTA with the annotated sequence of Curcuma kwangsiensis as a reference to compare similarities and detect any rearrangement or inversion among nine newly sequenced Zingibereae species which make pairwise alignments in the LAGAN model [38]. The rates of nonsynonymous (Ka) and synonymous substitutions (Ks) were calculated in DnaSP 6.0 based on 80 protein coding regions [39]. In DnaSP 6.0, the sequence polymorphism and nucleotide diversity (Pi) values were evaluated.

The Analysis of Long Repetitive Sequences and Simple Sequence Repeats (SSRs)
The long repeats (forward, reverse, palindromic and complementary) among the complete chloroplast genome of nine newly sequenced Zingibereae species based on the size and location of the long repeats in REPuter were calculated [40]. The detection parameter settings were a minimum repeat size of 30 bp, and the Hamming distance of 3. MISA software (http://pgrc.ipk-gatersleben.de/misa/, accessed on 24 January 2021) was used to detect SSRs. The parameters were set as follows: ≥ten for mono-; ≥five for di-; ≥four for tri-, ≥three for tetra-, ≥three for penta-and ≥three for hexa-. The interruptions (max difference between 2 SSRs) less than 9 bp were termed "complex".

Phylogenetic Analysis
In this study, 30 accessions of eight genera (one Boesenbergia species, one Cautleya (Royle ex Benth.) Hook.f. species, 14 Curcuma species, two Hedychium J.Koenig. species, three Kaempferia species, one Pyrgophyllum species, three Roscoea Sm. species, two Stahlianthus species, three Zingiber Mill. species) belonging to Zingibereae were analyzed. Nine outgroup species included four Alpinia Roxb. species, two Amomum Roxb. species, one Lanxangia M.F.Newman & Škorničk. species and two Wurfbainia Giseke species. Another two species from the closely related family (Costus viridis S.Q. Tong and Musella lasiocarpa (Franch.) C.Y. Wu) were used to root the trees. Except for nine newly sequenced species, the remaining 32 published chloroplast genomes were downloaded from NCBI. A list of GenBank accessions is provided in Supplementary Table S1.
In order to make a more reasonable utilization of the relationships based on phylogenetic trees, we used a complete chloroplast genome, CDS, LSC and intron sequences for phylogenetic analysis. The software MAFFT version 7.0 was used to align the multiple sequences before inferring the phylogenetic trees [41]. Maximum likelihood (ML) methods in the program PAUP * Version 4.0 were used to construct the phylogenetic trees [42].

The Genome Structure and Content of Nine Zingibereae Species
Chloroplast genomes of nine Zingibereae species (six newly reported) were sequenced and assembled with lengths ranging from 162,042 bp (Pyrgophyllum yunnanense) to 163,539 bp (Curcuma aff. singularis) ( Table 2). All cp genomes had a typical quadripartite circular structure with a pair of IR regions that separated the LSC and SSC regions, and the gene map of the B. kingii chloroplast genomes was presented in Figure 1 as representative. The LSC region ranged from 86,943 bp (C. aff. plicata) to 88,251 bp (C. aff. singularis), accounting for 33.78%-34.11% of the total length. The SSC region ranged from 15,568 bp (Stahlianthus involucratus to 16,023 bp (P. yunnanense), accounting for 29.14%-29.66% of the total length. The IR regions ranged from 29,379 bp (P. yunnanense) to 30,117 bp (S. involucratus), accounting for 40.89%-41.30% of the total length.

Condon Usage Bias
A total of 67 coding genes were used to estimate the codon usage frequency based on the relative synonymous codon usage (RSCU) value (Table S4). All genes were encoded by 27,705 (P. yunnanense) to 27,904 (S. involucratus) codons. UAA, UGA and UAG were considered to be the termination codons. For nine Zingibereae species, the serine encoded by AGC had the lowest RSCU value (0.31), while methionine encoded by AUG had the highest one (2.65). The AUU, AAA and GAA encoded isoleucine, lysine and glutamic acid, respectively, had higher frequencies of occurrence than others (more than 1100). In addition, A or T had a higher nucleotide frequency than G or C in the third codon position, which was relatively common in most angiosperm, and the richness of A or T in the IR regions was the principal reason [43] (Figure 2 and Table S4).

Comparative Genomic Analysis
A high degree of synteny and gene order conservation indicated a high evolutionary conservatism at plastome level ( Figure 3). It is clear that the Curcuma species can be separated from the other Zingibereae species by many genes, such as atpF, rpl16 and atpH-atpI. However, the divergence among five Curcuma species was very low. Notably, the regions of LSC and SSC had more variation than the regions of IRs, and the non-coding regions had a greater differentiation than that of coding regions. Some regions had more variation, such as matK, rps16, atpF, ndhH, clpP among the coding regions, ycf1 intron, and atpH-atpI, petN-psbM, trnA-psbD and rpl32-trnL in the intergenic regions.

Expansion and Contraction of Inverted Repeats (IRs)
Among nine Zingibereae species, the sizes of the IR regions varied from 29,379 bp (P. yunnanense) to 30,117 bp (S. involucratus). The rpl22 genes were located on the boundaries of LSC regions, and the distance between rpl22 and the boundary of LSC/IRb ranged from 23 bp (C. aff. singularis) to 538 bp (P. yunnanense). The rps19 coding gene was located in the IRb region but that gene of P. yunnanense was located in the LSC region. The distance between rps19 and the boundary of LSC-IRb ranged from 3 bp (P. yunnanense) to 167 bp (S. involucratus). The IRb/SSC and SSC/IRa boundary was crossed by the ycf1 gene, which was a critical gene. In the IRb/SSC boundary, the ycf1 gene located in the SSC region was from 4 bp (S. involucratus) to 43 bp (C. ruiliensis). At the SSC/IRa boundary, the ycf1 gene located in the SSC region was from 1210 bp (S. involucratus) to 1592 bp (P. yunnanense). For S. involucratus, the ndhF gene spanned the IRb region and the SSC region. However, in the other eight Zingibereae species, the ndhF gene was located in the SSC region. At the SSC/IRa boundary, the rps15 gene located in the SSC region was from 1657 bp (S. involucratus) to 2037 bp (P. yunnanense). The psbA gene was located on the right side of IRa/LSC regions with the distance of 103 bp (C. aff. singularis)-260 bp (K. rotunda L.) ( Figure 5).

Repeat Structure and SSR Analysis
There were a total of 836 long repeats among nine Zingibereae species, including forward repeats, palindromic repeats, reverse repeats and complement repeats ( Figure 6 and Table S6). Curcuma aff. plicata had the largest number of repeats, including 39 forward, 45 palindromic, 40 reverse and 20 complement repeats, while Boesenbergia kingii had the least number of repeats, including 23 forward, 28 palindromic, ten reverse and three complement repeats. Curcuma aff. singularis had the least number of complement repeats (having only one). In all, the repeats mostly ranged from 30 to 137 bp. The majority of these repeats showed lengths of 30, 31 and 33 bp.  Simple sequence repeat (SSR), also known as tandem repeats or microsatellites, consists of DNA repeat with sizes of 1-6 bp and can be used as important molecular markers for species identifications [44][45][46]. There were seven kinds of SSRs in nine Zingibereae species: mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, hexanucleotide and complex. There were 95-118 SSRs in each species (Figure 7a). Among each species, mononucleotide repeats were the most common one, with numbers ranging from 45-57; followed by dinucleotide ranging from 25-36; tetranucleotide SSRs ranging from 13-23; trinucleotide SSRs ranging from 3-10; complex SSRs ranging from 3-10; pentanucleotide SSRs ranging from 1-2; hexanucleotide SSRs ranging from 0-1 (only found in P. yunnanense). Among nine Zingibereae species, the results showed that the mononucleotide A/T repeats accounted for 43.95% and 54.44%, respectively (Table S7). The mononucleotide C/G repeats accounted for 1.06% and 0.55%, respectively. The number of mononucleotide A/T repeats in C. aff. singularis was 23/34, which was the highest, while Kaempferia rotunda had the lowest (17/26).

Sequence Divergence Hotspots
In the chloroplast genomes, the divergence hotspots can provide useful information and are often applied to assess geographic distribution and phylogeny [28,[47][48][49]. Our results indicated that the Pi values in the coding regions were lower than those in the intergenic regions (Table S8). For the coding regions, the values of the LSC regions ranged from 0.0010-0.0933, followed by the values of the SSC regions ranging from 0.000-0.0753 and the values of the IR regions ranging from 0.0000-0.0175. Three high divergence hotspots, viz. rps2, rpoC2 and rps15, were selected as potential molecular markers to identify related species (Figure 8).

Phylogenetic Analysis Based on Chloroplast Genomes
In this study, we utilized 41 complete cp genomes, including nine newly sequenced genomes and 20 previously reported chloroplast genomes of Zingibereae species, nine other Zingiberaceae species and two species from closely related families to infer phylogenetic relationships. The phylogenetic trees were constructed based on a complete cp genome, the coding regions (CDS), LSC region and intron data. The phylogenetic trees using four different datasets had different topologies, but all recognized the monophyly of the Zingibereae (Figure 9, Figures S1-S3). ML phylogeny inferred from CDS was the best resolved, thus is displayed here and discussed below (Figure 9). Two clades are recognized, namely the Curcuma clade and the Hedychium clade (Figure 9). Only the Curcuma clade is strongly supported (Figure 9). Resolution within the Curcuma clade is rather high. Three groups, 'Ecomatae', 'Hitcheniopsis' and 'Curcuma', are strongly supported, although the Ecomatae group was only represented by one species here, namely Curcuma aff. singularis. The Hitcheniopsis group, represented by C. alismatifolia and Stahlianthus involucratus, was resolved in a sister position to the Ecomatae group. The Curcuma group was represented by the remaining species. Relationships within the Curcuma group were not satisfactorily resolved.
Within the Hedychium clade, Hedychium, Kaempferia, Roscoea and Zingiber are supported to be monophyletic with a high support value (bootstrap value = 100%) ( Figure 9). Kaempferia, Zingiber and Boesenbergia formed a group separate from the rest with strong support (bootstrap value = 100%), while the relationships among the remaining genera are unresolved. Pyrgophyllum is nested within the Hedychium clade. Similar results were also found in Figures S1-S3.

Discussion
All nine complete cp genomes of Zingibereae species had a typical four-segment structure, including 84-88 coding genes, 38 tRNAs and 8 rRNAs. The genome size of nine Zingibereae species ranged from 162,042 to 163,539 bp with GC content ranging from 36.00% to 36.25%. The size range of these sequenced cp genomes are similar to the sizes of the earlier reported cp genomes of Zingibereae species [24,[50][51][52]. The IRs of earlier reported species were found to be different in length between 28,950 bp and 30,150 bp but the IRs of the nine species reported here varied from 29,379 bp to 30,117 bp. Thus, the expansion and contraction of the size in IR region was the main reason for the genome sizes variation among the Zingibereae species. Additionally, IR expansion or contraction is generally accompanied with the change of gene location. For example, the rps19 gene as a pseudogene frequently spanned the LSC-IR and SSC-IR boundaries in some angiosperms [29,53]. However, in Zingibereae species, rps19 coding gene was located in the IRb regions, while it was located in the LSC region in P. yunnanense. In nine Zingibereae species, the rps19 gene was fully duplicated in accordance with the results reported in other Zingiberaceae species [24,51,52,54]. Pseudogene ψycf1 was also related to the contraction and expansion of the IR region. ψycf1 was present in Zingibereae species, which was truncated at the IR/SSC boundary. In previous studies, ycf1 has been used in the phylogeny of some taxa [55,56], while our results showed ψycf1 had no phylogenetic significance in Zingibereae species. Differences in the location of genes between species provide useful information on evolutionary relationships in genetic research. In this study, it was clear that the organization, genome size and structure of the nine chloroplast genomes were highly conserved. The largest variation of Zingibereae cp genomes was the intergenic areas, which was similar to other chloroplast genomes [19].
Meanwhile, the low ratios of Ka/Ks and evolutionary rate were assessed among nine Zingibereae species. Most of the genes (Ka/Ks = 0) with the lowest evolutionary rates were photosynthetic genes, e.g., ndhC, ndhJ, ndhK, petG, petL, psaC, psaI, psbE and psbF. The ycf1, ycf4 and ccsA genes evolved more quickly and had higher Ka/Ks (≥1). The evolutionary rate of clpP was species-specific [57], while the clpP gene among nine Zingibereae species experienced negative selection and the ratio of Ka/Ks was 0.3326, which was far less than that of many taxa [58][59][60]. One previous study showed that the gene had gone through spells of relatively accelerated sequence evolution, and thus led to the intron loss in some plants [57]. In this study, the clpP gene contained two introns in nine Zingibereae species, which might be the reason for the low ratio of Ka/Ks. Zingibereae species mostly grow in disturbed habitats, and the environmental conditions of their habitats vary from tropical rainforest (wet-hot) to Qinghai-Tibet Plateau (cold-drought). This promotes gene exchange among colonies of the population in inferior and unfavorable habitats. Genes under positive selection often bring on many repeating amino acid sequence insertions in varying degrees and that may be involve in a recent increase in diversification rate after adapting to a new ecological environment [61]. To understand the ratios of Ka/Ks and the evolutionary rate of genes would provide us valuable information on how Zingibereae species adapt to their environment.
The SSRs are typically mononucleotide tandem repeat DNA sequences that are widely used for species identification and genetic diversity research [62,63]. The SSRs mainly consist of short polyadenine or polythymine repeats and ranged from 95 to 118 among nine Zingibereae species, which were in agreement with previous studies [24,51,52,54]. Due to a lack of genome resources in Zingibereae, the SSRs can be used for species identification and genetic diversity research on Zingibereae species and their relatives.
Chloroplast genome sequences have been valuable in molecular, evolutionary, and phylogenetic studies. Numerous analyses on the basis of cp genome sequence comparison have resolved various phylogenetic problems and improved our understanding of complex evolutionary associations among angiosperms [27,64,65]. Our phylogenomic analyses based on cp genome sequences also revealed that the phylogenetic resolution within Zingibereae has been greatly improved (with high support and the similar topology among different analyses) in comparison to the most comprehensive previous phylogenetic studies of the Zingibereae based on the nuclear ribosomal ITS region and the plastid matK and trnL-F regions [1,6]. Our results strongly supported that Zingibereae was separated from Alpinieae, which agreed with the past study [1]. Based on matK and ITS combined, a Kaempferia clade, including Boesenbergia, Kaempferia, Zingiber is weakly supported [1], but we obtain strong support from cp genome sequences, and which is similar to the conclusion made in other studies by DNA barcodes [66,67]. Based on the combination of trnL-F region and ITS, the tribe is divided into two major clades, the Curcuma clade and the Hedychium clade. Nonetheless, these two studies showed that the relationships within these clades remained uncertain because statistical support was weak. Our phylogenetic trees demonstrated that these two major clades were identified in the Zingibereae; namely, the Curcuma clade in the sense of Kress et al. (2002) [1] with strong support and the Hedychium clade in the sense of Ngamriabsakul et al. (2004) [6] with weak support.
Within the Curcuma clade, Stahlianthus is closely related to Curcuma at the molecular level, supporting the idea of merging it into Curcuma [9]. Our results confirmed the monophyly of Curcuma and the infrageneric classification proposed by Záveská et al. (2012) in which C. subg. Curcuma and C. subg. Hitcheniopsis (Baker) K.Schum. were retained and a new subgenus, C. subg. Ecomatae Škorničk. & Šída f. was proposed [8]. The representatives of the Hitcheniopsis group resolved here correspond to Curcuma subg. Hitcheniopsis [11]. The Curcuma group includes species traditionally classified in subgenus Curcuma. In accordance with previous studies [1,8], Curcuma subg. Ecomatae represented by C. aff. singularis here is more closely related to C. subg. Hitcheniopsis than C. subg. Curcuma based on the cpDNA data. However, relationships of species within these clades are complex because polyploidization and hybridization were important for the speciation of Curcuma species. More detailed analyses of species relationships within Curcuma will be the subject of further studies.
Within the Hedychium clade, a Kaempferia group in the sense of Kress et al. (2002) [1] consisting of Boesenbergia, Kaempferia and Zingiber was also identified with strong support (bootstrap value = 100%), whereas the relationships of the remaining members were unresolved. According to the complete cp genome, the coding regions (CDS), LSC region and intron data, Kaempferia is supported to be monophyletic and is sister to Zingiber. Since only one species, B. kingii, belonging to Boesenbergis was sampled, the relationship within Boesenbergia was unable to be further investigated.
In the previous phylogeny study [15], Pyrgophyllum yunnanense was very closely related to the genus Curcuma. However, P. yunnanense is not allied with the Curcuma clade but instead is embedded within the Hedychium clade. Despite these findings, the systematic relationships of P. yunnanense remain uncertain. The natural hybridization and polyploidization were the main cause of inconsistency in Zingibereae. Considering the Zingibereae hybrid origin, the features of maternal inheritance in the chloroplast genome could provide more evidence to clarify their phylogenetic relationships. Further sampling of Zingibereae species may prove their relationships.

Conclusions
In this study, complete chloroplast genomes of nine Zingibereae species including Boesenbergia kingii, Curcuma aff. plicata, C. aff. singularis, C. ruiliensis, Kaempferia rotunda, and Pyrgophyllum yunnanense were firstly published. The chloroplast genomes of nine Zingibereae species were similar in structure, composition and gene order, showing that the chloroplast genomes studied here are highly conserved. Moreover, we also identified the SSR sites and three divergence hotspots (rps2, rpoC2 and rps15), which could provide powerful markers for phylogenetic and identification analyses within Zingibereae.
Our results shed a new light on the phylogenetic relationships within Zingibereae and demonstrated the continuing power of plastome sequencing to improve phylogenetic resolution among the complicated taxa of Zingiberaceae. The phylogenomic analysis strongly supported the idea that Zingibereae is monophyletic and can be divided into two clades, namely the Curcuma clade and the Hedychium clade. The monophyly of the genus Curcuma and three subgenera in Curcuma are confirmed with high support. Our results also showed that Hedychium, Kaempferia, Roscoea and Zingiber are strongly supported to be monophyletic. Pyrgophyllum yunnanense is not allied with the Curcuma clade but instead is embedded within the Hedychium clade. However, the systematic relationships of Pyrgophyllum and Boesenbergia remain unresolved. Further work based on broader sampling within Zingibereae is needed.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/f12060710/s1, Figure S1: Phylogenetic tree constructed using the complete chloroplast genome data. Figure S2: Phylogenetic tree constructed using intron data. Figure S3: Phylogenetic tree constructed using LSC data. Table S1. The GenBank accession numbers of 32 species using in phylogenetic analysis. Table S2. Genes contained in nine sequenced Zingibereae chloroplast genome. Table S3. The genes with introns in the nine Zingibereae chloroplast genomes. Table S4. Codon usage and codon-anticodon recognition pattern of nine Zingiberaeace species.