Complete Chloroplast Genome Sequences of Kaempferia Galanga and Kaempferia Elegans: Molecular Structures and Comparative Analysis

Kaempferia galanga and Kaempferia elegans, which belong to the genus Kaempferia family Zingiberaceae, are used as valuable herbal medicine and ornamental plants, respectively. The chloroplast genomes have been used for molecular markers, species identification and phylogenetic studies. In this study, the complete chloroplast genome sequences of K. galanga and K. elegans are reported. Results show that the complete chloroplast genome of K. galanga is 163,811 bp long, having a quadripartite structure with large single copy (LSC) of 88,405 bp and a small single copy (SSC) of 15,812 bp separated by inverted repeats (IRs) of 29,797 bp. Similarly, the complete chloroplast genome of K. elegans is 163,555 bp long, having a quadripartite structure in which IRs of 29,773 bp length separates 88,020 bp of LSC and 15,989 bp of SSC. A total of 111 genes in K. galanga and 113 genes in K. elegans comprised 79 protein-coding genes and 4 ribosomal RNA (rRNA) genes, as well as 28 and 30 transfer RNA (tRNA) genes in K. galanga and K. elegans, respectively. The gene order, GC content and orientation of the two Kaempferia chloroplast genomes exhibited high similarity. The location and distribution of simple sequence repeats (SSRs) and long repeat sequences were determined. Eight highly variable regions between the two Kaempferia species were identified and 643 mutation events, including 536 single-nucleotide polymorphisms (SNPs) and 107 insertion/deletions (indels), were accurately located. Sequence divergences of the whole chloroplast genomes were calculated among related Zingiberaceae species. The phylogenetic analysis based on SNPs among eleven species strongly supported that K. galanga and K. elegans formed a cluster within Zingiberaceae. This study identified the unique characteristics of the entire K. galanga and K. elegans chloroplast genomes that contribute to our understanding of the chloroplast DNA evolution within Zingiberaceae species. It provides valuable information for phylogenetic analysis and species identification within genus Kaempferia.


Introduction
The genus Kaempferia belongs to the family Zingiberaceae, which consists of approximately 50 species in the world [1][2][3]. Kaempferia species are distributed in tropical Asia regions [1,2]. Kaempferia species are grown primarily for their ornamental foliage rather than for their flowers [3]. In addition, several species have long been cultivated for their medicinal properties [3]. Kaempferia galanga and Kaempferia elegans are valuable herbal medicine and ornamental plants in this genus, respectively. K. galanga is mainly distributed in the regions of Southern and Northwestern China (Guangdong,   There were 111 predicted genes in the K. galanga chloroplast genome including 79 protein-coding genes, 28 tRNA genes and 4 rRNA genes, while 113 genes predicted in the K. elegans chloroplast genome consisted of 79 protein-coding genes, 30 tRNA genes and 4 rRNA genes ( Table 2). Among the protein-coding genes in K. galanga chloroplast genome, 61 genes were located in the LSC region, 12 genes were in the SSC region and 8 genes were duplicated in the IR regions (Supplementary file 1). In total, there were 18 intron-containing genes in the K. galanga chloroplast genome, 16 of which contained one intron and two of which (ycf3 and clpP) contained two introns (Table 3). Among the protein-coding genes in the K. elegans chloroplast genome, 63 genes were located in the LSC region, 12 genes were in the SSC region and 6 genes were duplicated in the IR regions (Supplementary file 2). In total, there were 17 intron-containing genes in the K. elegans chloroplast genome, 15 of which contained one intron and two of which (ycf3 and clpP) contained two introns (Table 3). Table 2. Genes present in the chloroplast genomes of K. galanga and K. elegans.

Analysis of SSRs and Long Repeats
SSRs or microsatellites, are tandem repeat sequences consisting of 1-6 nucleotide repeat units and are widely distributed in chloroplast genomes [5,7,12]. SSRs were detected using MISA in both Kaempferia species chloroplast genomes. We detected 240 and 248 SSRs in K. galanga and K. elegans chloroplast genomes, respectively. Mononucleotide motifs were the most abundant type of repeat and dinucleotides were the second most abundant in both Kaempferia species chloroplast genomes ( Figure 3 and Table S2). There were 177 momo-, 32 di-, 6 tri-, 21 tetra-, 3 penta-and one hexanucleotide SSRs in K. galanga chloroplast genome; by contrast, there were 188 momo-, 33 di-, 7 tri-, 19 tetra-and 1 penta-nucleotide SSRs in K. elegans chloroplast genome ( Figure 3). The majority of SSRs were located in the LSC regions rather than in IR and SSC regions of both Kaempferia species chloroplast genomes ( Figure 3 and Table S2). SSRs were more abundant in non-coding regions than in coding regions of both genomes ( Figure 3). Furthermore, almost all SSR loci were composed of A or T, which contributed to the bias in base composition (A/T; both 63.9%) in the chloroplast genomes of the two Kaempferia species.

Analysis of SSRs and Long Repeats
SSRs or microsatellites, are tandem repeat sequences consisting of 1-6 nucleotide repeat units and are widely distributed in chloroplast genomes [5,7,12]. SSRs were detected using MISA in both Kaempferia species chloroplast genomes. We detected 240 and 248 SSRs in K. galanga and K. elegans chloroplast genomes, respectively. Mononucleotide motifs were the most abundant type of repeat and dinucleotides were the second most abundant in both Kaempferia species chloroplast genomes ( Figure 3 and Table S2). There were 177 momo-, 32 di-, 6 tri-, 21 tetra-, 3 penta-and one hexa-nucleotide SSRs in K. galanga chloroplast genome; by contrast, there were 188 momo-, 33 di-, 7 tri-, 19 tetraand 1 penta-nucleotide SSRs in K. elegans chloroplast genome ( Figure 3). The majority of SSRs were located in the LSC regions rather than in IR and SSC regions of both Kaempferia species chloroplast genomes ( Figure 3 and Table S2). SSRs were more abundant in non-coding regions than in coding regions of both genomes ( Figure 3). Furthermore, almost all SSR loci were composed of A or T, which contributed to the bias in base composition (A/T; both 63.9%) in the chloroplast genomes of the two Kaempferia species.

Analysis of SSRs and Long Repeats
SSRs or microsatellites, are tandem repeat sequences consisting of 1-6 nucleotide repeat units and are widely distributed in chloroplast genomes [5,7,12]. SSRs were detected using MISA in both Kaempferia species chloroplast genomes. We detected 240 and 248 SSRs in K. galanga and K. elegans chloroplast genomes, respectively. Mononucleotide motifs were the most abundant type of repeat and dinucleotides were the second most abundant in both Kaempferia species chloroplast genomes ( Figure 3 and Table S2). There were 177 momo-, 32 di-, 6 tri-, 21 tetra-, 3 penta-and one hexanucleotide SSRs in K. galanga chloroplast genome; by contrast, there were 188 momo-, 33 di-, 7 tri-, 19 tetra-and 1 penta-nucleotide SSRs in K. elegans chloroplast genome ( Figure 3). The majority of SSRs were located in the LSC regions rather than in IR and SSC regions of both Kaempferia species chloroplast genomes ( Figure 3 and Table S2). SSRs were more abundant in non-coding regions than in coding regions of both genomes ( Figure 3). Furthermore, almost all SSR loci were composed of A or T, which contributed to the bias in base composition (A/T; both 63.9%) in the chloroplast genomes of the two Kaempferia species.   Long repeat sequences in the K. galanga and K. elegans chloroplast genomes were analyzed by REPuter and results shown in Figure 4 and Table S3. In the K. galanga chloroplast genome, 21 forward repeats, 20 palindrome repeats, 5 reverse repeats and 4 complement repeats were detected (identity>90%) (Figure 4 and Table S3). In comparison, in the K. elegans chloroplast genome, 26 forward repeats, 17 palindrome repeats, 4 reverse repeats and 2 complement repeats were detected ( Figure 4 and Table S3). Out of the 50 repeats in K. galanga chloroplast genome, 38 repeats (76.0%) were 30-39 bp long, 9 repeats (18.0%) were 40-49 bp long and 3 repeats (6.0%) were ≥50 bp long ( Figure 4 and Table S3). By contrast, of the 49 repeats in K. elegans chloroplast genome, 37 repeats (75.5%) were 30-39 bp long, 6 repeats (12.2%) were 40-49 bp long and 6 repeats (12.2%) were ≥50 bp long ( Figure 4 and Table S3). The majority of these repeats were mainly forward and palindromic types with lengths mainly in the range of 30-50 bp in both Kaempferia species.

IR Contraction and Expansion
A detailed comparison was performed for four junctions, LSC/IRa, LSC/IRb, SSC/IRa and SSC/IRb, between the two IRs (IRa and IRb) and the two single-copy regions (LSC and SSC) among A. zerumbet, C. flaviflora and Z. spectabile in comparison to K. galanga and K. elegans ( Figure 5). Although the IR region of the five Zingiberaceae species chloroplast genomes was highly conserved, structure variation was still found in the IR/SC boundary regions. As shown in Figure 5, the rpl22-rps19 genes were located in the junctions of the LSC/IRb regions in K. galanga, K. elegans, A. Long repeat sequences in the K. galanga and K. elegans chloroplast genomes were analyzed by REPuter and results shown in Figure 4 and Table S3. In the K. galanga chloroplast genome, 21 forward repeats, 20 palindrome repeats, 5 reverse repeats and 4 complement repeats were detected (identity>90%) ( Figure 4 and Table S3). In comparison, in the K. elegans chloroplast genome, 26 forward repeats, 17 palindrome repeats, 4 reverse repeats and 2 complement repeats were detected ( Figure 4 and Table S3). Out of the 50 repeats in K. galanga chloroplast genome, 38 repeats (76.0%) were 30-39 bp long, 9 repeats (18.0%) were 40-49 bp long and 3 repeats (6.0%) were ≥50 bp long ( Figure 4 and Table S3). By contrast, of the 49 repeats in K. elegans chloroplast genome, 37 repeats (75.5%) were 30-39 bp long, 6 repeats (12.2%) were 40-49 bp long and 6 repeats (12.2%) were ≥50 bp long ( Figure 4 and Table S3). The majority of these repeats were mainly forward and palindromic types with lengths mainly in the range of 30-50 bp in both Kaempferia species. Long repeat sequences in the K. galanga and K. elegans chloroplast genomes were analyzed by REPuter and results shown in Figure 4 and Table S3. In the K. galanga chloroplast genome, 21 forward repeats, 20 palindrome repeats, 5 reverse repeats and 4 complement repeats were detected (identity>90%) ( Figure 4 and Table S3). In comparison, in the K. elegans chloroplast genome, 26 forward repeats, 17 palindrome repeats, 4 reverse repeats and 2 complement repeats were detected ( Figure 4 and Table S3). Out of the 50 repeats in K. galanga chloroplast genome, 38 repeats (76.0%) were 30-39 bp long, 9 repeats (18.0%) were 40-49 bp long and 3 repeats (6.0%) were ≥50 bp long ( Figure 4 and Table S3). By contrast, of the 49 repeats in K. elegans chloroplast genome, 37 repeats (75.5%) were 30-39 bp long, 6 repeats (12.2%) were 40-49 bp long and 6 repeats (12.2%) were ≥50 bp long ( Figure 4 and Table S3). The majority of these repeats were mainly forward and palindromic types with lengths mainly in the range of 30-50 bp in both Kaempferia species.

IR Contraction and Expansion
A detailed comparison was performed for four junctions, LSC/IRa, LSC/IRb, SSC/IRa and SSC/IRb, between the two IRs (IRa and IRb) and the two single-copy regions (LSC and SSC) among A. zerumbet, C. flaviflora and Z. spectabile in comparison to K. galanga and K. elegans ( Figure 5). Although the IR region of the five Zingiberaceae species chloroplast genomes was highly conserved, structure variation was still found in the IR/SC boundary regions. As shown in Figure 5, the rpl22-rps19 genes were located in the junctions of the LSC/IRb regions in K. galanga, K. elegans, A. zerumbet

IR Contraction and Expansion
A detailed comparison was performed for four junctions, LSC/IRa, LSC/IRb, SSC/IRa and SSC/IRb, between the two IRs (IRa and IRb) and the two single-copy regions (LSC and SSC) among A. zerumbet, C. flaviflora and Z. spectabile in comparison to K. galanga and K. elegans ( Figure 5). Although the IR region of the five Zingiberaceae species chloroplast genomes was highly conserved, structure variation was still found in the IR/SC boundary regions. As shown in Figure 5, the rpl22-rps19 genes were located in the junctions of the LSC/IRb regions in K. galanga, K. elegans, A. zerumbet and C. flaviflora, though the trnM-ycf2 sequence in Z. spectabile, one of which was missing the rpl22/-rps19 gene in the junctions of the LSC/IRb regions. The ycf1-ndhF genes were located at the junctions of the IRb/SSC regions in the five Zingiberaceae species chloroplast genomes. The ndhF gene was 23, 98, 251, 133 and 33 bp from the IRb/SSC border in K. galanga, K. elegans, A. zerumbet, C. flaviflora and Z. spectabile, respectively ( Figure 5). The SSC/IRa junctions in the five Zingiberaceae species chloroplast genomes were crossed by the ycf1 gene, with 665-3888 bp in the IRa region. Like the IRb/SSC boundary regions, the IRa/LSC regions were also variable. The rps19-psbA genes were located in the junctions of the IRa/LSC regions in K. galanga, K. elegans, A. zerumbet and C. flaviflora, though the trnH-psbA genes in Z. spectabile, one of which was missing the rps19 gene in the junctions of the IRa/LSC region. The rps19-psbA genes of K. elegans were located at the junctions of IRa/LSC regions with 136 and 123 bp, respectively, separating the spacer from the end of the IRa region. However, in Z. spectabile, the trnH gene was the last gene at one end of the IRa region, 256 bp away from the IRa/LSC border. Overall, contraction and expansion of the IR regions was detected across the five Zingiberaceae species chloroplast genomes.

Comparative Chloroplast Genomic Analysis
To characterize genome divergence, we performed multiple sequence alignments between the five Zingiberaceae species chloroplast genomes using the program mVISTA, with K. galanga being used as a reference ( Figure 6). The comparison demonstrated that the two IR regions were less divergent than the LSC and SSC. Moreover, the coding regions are more conserved than the noncoding regions. The most highly divergent regions among the five chloroplast genomes were found among the intergenic spacers, including trnH-psbA, rps16-psbK, atpH-atpI, petN-psbM, trnE-psbD psbC-rps14, rps4-ycf3, rps4-ndhJ, ndhC-atpE, ycf4-cemA and petA-psbJ in LSC as well as rpl32-ccsA, psaC-ndhG and ndhG-ndhI in SSC. Higher divergence in the coding regions was found in the matK, rpoA, rps16, rps19, ndhF, ccsA, psaC, ndhD, ndhE, ndhG, ndhI, ndhA and ycf1 sequences. Figure 6. Comparison of five chloroplast genomes, with K. galanga as a reference using mVISTA alignment program. Gray arrows and thick black lines above the alignment indicate gene orientation. Purple bars represent exons, sky-blue bars represent transfer RNA (tRNA) and ribosomal RNA (rRNA), red bars represent non-coding sequences (CNS) and white peaks represent differences of chloroplast genomes. The y-axis represents the identity percentage ranging from 50 to 100%.
Furthermore, sliding window analysis using DnaSP detected highly variable regions in the chloroplast genomes between K. galanga and K. elegans ( Figure 7A). The average value of nucleotide diversity (Pi) was 0.01075. The IR regions showed lower variability than the LSC and SSC regions. There were 7 mutational hotspots that exhibited remarkably higher Pi values (>0.03) and were located at the LSC and SSC regions, which included trnS-trnG, rps12-clpP, psbT-psbN, ycf1-ndhF, ndhF-rpl32, psaC-ndhE and ccsA-ndhD regions from the chloroplast genomes ( Figure 7A). By contrast, there was only 1 mutational hotspot (rpl2-trnH) that exhibited remarkably higher Pi values (>0.03) located at the IR regions ( Figure 7A). Figure 6. Comparison of five chloroplast genomes, with K. galanga as a reference using mVISTA alignment program. Gray arrows and thick black lines above the alignment indicate gene orientation. Purple bars represent exons, sky-blue bars represent transfer RNA (tRNA) and ribosomal RNA (rRNA), red bars represent non-coding sequences (CNS) and white peaks represent differences of chloroplast genomes. The y-axis represents the identity percentage ranging from 50 to 100%. Furthermore, sliding window analysis using DnaSP detected highly variable regions in the chloroplast genomes between K. galanga and K. elegans ( Figure 7A). The average value of nucleotide diversity (Pi) was 0.01075. The IR regions showed lower variability than the LSC and SSC regions. There were 7 mutational hotspots that exhibited remarkably higher Pi values (>0.03) and were located at the LSC and SSC regions, which included trnS-trnG, rps12-clpP, psbT-psbN, ycf1-ndhF, ndhF-rpl32, psaC-ndhE and ccsA-ndhD regions from the chloroplast genomes ( Figure 7A). By contrast, there was only 1 mutational hotspot (rpl2-trnH) that exhibited remarkably higher Pi values (>0.03) located at the IR regions ( Figure 7A).  Figure 7B showed that the average value of Pi was 0.01591 among two Kaempferia species, A. zerumbet, C. flaviflora and Z. spectabile. The Pi values of these five species were commonly higher than those of the two Kaempferia species. Particularly, seven highly divergent loci showed remarkably higher Pi values (>0.045), including trnS-trnG, psbT-psbN, trnH-rpl2, trnI-ycf2, ccsA-ndhD, psaC-ndhE and ycf2-trnI regions from the chloroplast genomes ( Figure 7B). These regions may be undergoing rapid nucleotide substitution at the species level, indicating potential application of molecular markers for plant identification and phylogenetic analysis.
The chloroplast genomes of K. galanga and K. elegans were found to show a 256 bp difference in length (Table 1). In addition to the total length difference, we assessed SNP and Indel variations between the two Kaempferia species chloroplast genomes in their entirety. There were 536 SNPs identified in the two chloroplast genomes (Table S4). The most frequently occurring mutations were located in intergenic region, which included 357 SNPs. The coding regions contained 91 synonymous SNPs, 87 nonsynonymous SNPs and 1 stop mutation. There were 107 indels in the chloroplast genome identified between K. galanga and K. elegans (Table S5) Figure 7B showed that the average value of Pi was 0.01591 among two Kaempferia species, A. zerumbet, C. flaviflora and Z. spectabile. The Pi values of these five species were commonly higher than those of the two Kaempferia species. Particularly, seven highly divergent loci showed remarkably higher Pi values (>0.045), including trnS-trnG, psbT-psbN, trnH-rpl2, trnI-ycf2, ccsA-ndhD, psaC-ndhE and ycf2-trnI regions from the chloroplast genomes ( Figure 7B). These regions may be undergoing rapid nucleotide substitution at the species level, indicating potential application of molecular markers for plant identification and phylogenetic analysis.
The chloroplast genomes of K. galanga and K. elegans were found to show a 256 bp difference in length (Table 1). In addition to the total length difference, we assessed SNP and Indel variations between the two Kaempferia species chloroplast genomes in their entirety. There were 536 SNPs identified in the two chloroplast genomes (Table S4). The most frequently occurring mutations were located in intergenic region, which included 357 SNPs. The coding regions contained 91 synonymous SNPs, 87 nonsynonymous SNPs and 1 stop mutation. There were 107 indels in the chloroplast genome identified between K. galanga and K. elegans (Table S5), including 47 deletions and 60 insertions. Of the 107 indel markers between K. galanga and K. elegans genomes, the longest indels (10 bp) were located within the two intergenic sequences (petA-psbJ and atpH-atpI) and two coding sequences (atpF and rps12).

Phylogenetic Analysis
In this study, phylogenetic trees were constructed with SNPs from eleven species using ML and MP methods, respectively, including nine Zingiberaceae plants and using C. pulverulentus and C. indica as outgroups (Figure 8). Both the ML and MP phylogenetic trees strongly indicated that K. galanga and K. elegans formed a cluster within Zingiberaceae and the C. pulverulentus and C. indica species were clearly separated from Zingiberaceae species (Figure 8). Among nine Zingiberaceae species, they were clustered into four clusters. The first cluster comprised the genus Kaempferia (K. galanga and K. elegans). The second cluster comprised the two genera-Zingiber and Curcuma (Z. spectabile, C. flaviflora and C. roscoeana). The third cluster comprised the genus Amomum (A. kravanh and A. compactum). The fourth cluster comprised the genus Alpinia (A. zerumbet and A. oxyphylla).

Phylogenetic Analysis
In this study, phylogenetic trees were constructed with SNPs from eleven species using ML and MP methods, respectively, including nine Zingiberaceae plants and using C. pulverulentus and C. indica as outgroups (Figure 8). Both the ML and MP phylogenetic trees strongly indicated that K. galanga and K. elegans formed a cluster within Zingiberaceae and the C. pulverulentus and C. indica species were clearly separated from Zingiberaceae species (Figure 8). Among nine Zingiberaceae species, they were clustered into four clusters. The first cluster comprised the genus Kaempferia (K. galanga and K. elegans). The second cluster comprised the two genera-Zingiber and Curcuma (Z.

Potential RNA Editing Sites
In the present study, potential RNA editing sites were predicted for 34 genes; as a result, a total of 54 and 80 RNA editing sites were identified in the K. galanga and K. elegans chloroplast genomes, respectively (Table S6). No potential editing sites were identified in seven genes (petG, petL, psaB, psaI, psbL, rpl2, rpl23) in both chloroplast genomes. Of the 54 editing sites, which occurred in 21 genes, 15 (27.8%) and 39 (72.2%) were located at the first and the second codon position, respectively, in K. galanga. Of the 80 editing sites, which occurred in 26 genes, 21 (26.2%) and 59 (73.8%) were located at the first codon and the second codon position, respectively, in K. elegans. No editing sites were found at the third codon position in both Kaempferia species.

Potential RNA Editing Sites
In the present study, potential RNA editing sites were predicted for 34 genes; as a result, a total of 54 and 80 RNA editing sites were identified in the K. galanga and K. elegans chloroplast genomes, respectively (Table S6). No potential editing sites were identified in seven genes (petG, petL, psaB, psaI, psbL, rpl2, rpl23) in both chloroplast genomes. Of the 54 editing sites, which occurred in 21 genes, 15 (27.8%) and 39 (72.2%) were located at the first and the second codon position, respectively, in K. galanga. Of the 80 editing sites, which occurred in 26 genes, 21 (26.2%) and 59 (73.8%) were located at the first codon and the second codon position, respectively, in K. elegans. No editing sites were found at the third codon position in both Kaempferia species.

Plant Material and DNA Isolation
Fresh leaves were collected from potted K. galanga and K. elegans plants, respectively, from greenhouse in environmental horticulture research institute, Guangdong academy of agricultural sciences, Guangzhou, China. Total chloroplast DNA was extracted from about 100 g of leaves using the sucrose gradient centrifugation method as improved by Li et al. [21]. The chloroplast DNA concentration for each sample was estimated using an ND-2000 spectrometer (Nanodrop technologies, Wilmington, DE, USA), whereas visual examination was performed using gel electrophoresis.

Chloroplast Genome Sequencing and Genome Assembly
The chloroplast DNA was first fragmented into 300-500 bp using a Covaris M220 Focused-ultrasonicator (Covaris, Woburn, MA, USA) and used to construct short-insert libraries (insert size about 430 bp) according to the manufacturer's instructions (Illumina, San Diego, CA, USA). The short fragments were sequenced using an Illumina Hiseq X Ten platform (Novogene, Beijing, China). The Illumina raw reads were cleaned by removing the adapter sequences and low quality sequences, which included the reads with ambiguous nucleotides and ones containing more than 10% nucleotides in read with Q-value ≤ 20 and short reads (length < 50 bp).
The chloroplast DNA was also fragmented into 8-10 kb fragments, which were subjected to DNA sequencing following the standard protocol provided by PacBio platform (Novogene, Beijing, China). The PacBio raw reads were pre-processed by trimming the adapter sequences, low quality (Q < 0.80) reads, short reads (length < 100 bp) and short subreads (length <500 bp).
Initially, the Illumina clean reads were assembled using SOAPdenovo (version 2.04, Hongkong, China) with default parameters into principal contigs [22] and all contigs were sorted and joined into a single draft sequence using the software Geneious version 11.0.4 (Auckland, New Zealand) [23]. Next, BLASR software (San Diego, CA, USA) was used to compare the PacBio clean data with the single draft sequence and to extract the correction and error correction [24]. Next, the corrected PacBio clean data were assembled using Celera Assembler (version 8.0, Rockville, MD, USA) with default parameters, thus generating scaffolds [25]. Next, the assembled scaffolds were mapped back to the Illumina clean reads using GapCloser (version 1.12, Hongkong, China) for gap closing [22]. Finally, the redundant fragments sequences were removed, thus generating the final assembled chloroplast genomic sequence.

Chloroplast Genome Annotation and Codon Usage
The initial gene annotation of the chloroplast genome was carried out with BLAST homology searches and DOGMA (Dual Organellar Genome Annotator) [26]. tRNA genes were identified using tRNAscanSE with default settings [27]. The gene homologies were confirmed by comparing them with National Center for Biotechnology Information (NCBI)'s non-redundant (Nr) protein database, Clusters of orthologous groups (COG) for eukaryotic complete genomes database (http://www.ncbi. nlm.nih.gov/COG), Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.kegg.jp/), Gene Ontology (GO) (http://www.geneontology.org) and SWISS-PROT (http://web.expasy.org/ docs/swissprotguideline.html) databases. The structural features of chloroplast genome maps were drawn using OGDRAWv1.2 (Potsdam-Golm, Germany) [28]. Codon usage was determined for all protein-coding genes. To examine the deviation in synonymous codon usage, the relative synonymous codon usage (RSCU) was calculated using MEGA6 software (Version 6.0, Jeddah, Saudi Arabia) [29]. Amino acid (AA) frequency was also calculated and expressed by the percentage of the codons encoding the same amino acid divided by the total number of codons. The final chloroplast genomic sequences have been submitted to GenBank under accession numbers MK209001 and MK209002 for K. galanga and K. elegans, respectively.

SSRs and Long Repeat Structure
SSRs were identified using MIcroSAtellite (MISA) [30]. The parameters for SSRs were adjusted for identification of perfect mono-, di-, tri-, tetra-, pena-and hexanucleotide motifs with a minimum of 8, 5, 4, 3, 3 and 3 repeats, respectively. The online REPuter software was used to identify and locate forward, palindrome, reverse and complement repeat sequences with repeat sizes ≥30 bp and sequences identity ≥90% [31].

Comparative and Divergence Analysis of Chloroplast Genomes of K. galanga and K. elegans
The complete chloroplast genome of K. galanga was employed as a reference and was compared with the chloroplast genomes of K. elegans, Alpinia zerumbet (JX088668), Curcuma flaviflora (KR967361) and Zingiber spectabile (JX088661), the last three of which were obtained from GenBank, using mVISTA program (http://genome.lbl.gov/vista/mvista/about.shtml) in the Shuffle-LAGAN mode [32]. To calculate nucleotide variability (Pi) between K. galanga and K. elegans chloroplast genomes, sliding window analysis was performed using DnaSP version 5.1 software [33] with window length of 600 bp and the step size of 200 bp.
The complete chloroplast sequences of K. galanga and K. elegans were also aligned using MUMmer software (Maryland, USA) [34] and adjusted manually where necessary using Se-Al 2.0 [35]. The single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) were recorded separately as well as their locations in the chloroplast genome.

Phylogenetic Analysis
A molecular phylogenetic tree was constructed using SNP arrays from 11 species including K. galanga and K. elegans. Among these 11 species, nine complete chloroplast genome sequences were downloaded from NCBI: A. zerumbet (JX088668), C. flaviflora (KR967361), Z. spectabile (JX088661), C. roscoeana (NC_022928.1), Alpinia oxyphylla (NC_035895.1), Amomum kravanh (NC_036935.1), Amomum compactum (NC_036992.1), Costus pulverulentus (KF601573) and Canna indica (KF601570). K. galanga chloroplast genome was used as reference. Costus pulverulentus and Canna indica were set as outgroups of the family Zingiberaceae. Firstly, using MUMmer software [34], each chloroplast genome above was compared globally with the reference genome and the difference between each chloroplast genome and the reference genome found and preliminary filtering performed to detect the potential SNP sites. Secondly, the sequence of 100 bp on each side of the reference sequence SNP site was extracted and the extracted sequence and assembly results were compared using the BLAT software [36,37] to verify the SNP site. If the length of the alignment is less than 101 bp, it is considered to be a non-trusted SNP and is removed; if compared several times, the SNP that is considered to be a duplicate region and will also be removed; and finally a reliable SNP will be obtained. Thirdly, for each chloroplast genome, all SNPs are connected in the same order to obtain a sequence in FASTA format. Multiple FASTA format sequences alignments were carried out using ClustalX version 1.81 [38]. To examine the phylogenetic applications of rapidly evolving SNP markers, the maximum likelihood (ML) and maximum parsimony (MP) methods with 1000 bootstrap replicates were employed to construct phylogenetic trees using MEGA6 software, respectively [29].

RNA Editing Analysis
Thirty-four protein-coding genes of K. galanga and K. elegans chloroplast genomes were used to predict potential RNA editing sites using the online program Predictive RNA Editor for Plants (PREP) suite (http://prep.unl.edu/) with a cutoff value of 0.8 (Bielefeld, Germany) [39].

Discussion
In this study, we obtained the complete chloroplast genomes of K. galanga and K. elegans by using Illumina and PacBio sequencing, ranging from 163.5-163.8 kb in length. Both chloroplast genomes exhibit a typical quadripartite structure, as reported for other Zingiberaceae species, such as A. oxyphylla, A. zerumbet, C. flaviflora, Z. spectabile, C. roscoeana, A. compactum and A. kravanh [12]. Both genomes encode about 111-113 genes, including 79 protein-coding genes, 4 rRNA genes as well as 28 and 30 tRNA genes distributed throughout their genomes, respectively. This conformed with the protein-coding genes found in other Zingiberaceae members [12].
Besides the highly variable regions, we were able to retrieve SSRs and long repeats. Of the 240 SSRs identified in K. galanga, 64.58% (155 SSRs) were located in the LSC region, 16.66% (49 SSRs) in the SSC region and 18.75% (45 SSRs) in the IR regions. In contrast, out of the 248 SSRs identified in K. elegans, 64.11% (159 SSRs) were present in the LSC region, 16.93% (42 SSRs) in the SSC region and the remaining 19.75% (49 SSRs) located in the IR regions, as reported in other plants like A. kravanh [12], Talinum paniculatum [20] and Oryza minuta [45]. Among the SSRs types, the most abundant was found to be mononucleotides in both K. galanga and K. elegans (Figure 3). These findings were in agreement with results from previous studies in A. kravanh [12], T. paniculatum [20] and buckwheat species [43] but were different from O. minuta which possessed a majority of dinucleotide repeat motif SSRs [45]. AT/AT (12.5%) was the most frequent dinucleotide motifs followed by AAAT/ATTT in both K. galanga and K. elegans, respectively ( Figure 3). These SSRs and long repeats identified in our study could be useful in molecular studies, such as genetic diversity and phylogenetic relationship analysis, species identification and evolution studies [7,12,21,40,43].
In the present study, we also identified 536 SNPs and 107 indels between the two Kaempferia species (Tables S4 and S5). From the SNPs results, 536 nucleotide substitutions were detected between K. galanga and K. elegans chloroplast genomes. It indicated that the nucleotide substitution events in the chloroplast genomes of Kaempferia species were more than that between species of Oryza, Machilus, cultivated Fagopyrum, Citrus and Panax but less than species of Solanum and wild Fagopyrum. Comparative analysis of chloroplast genomes found 159 SNPs between Oryza nivara and O. sativa [46], 231 SNPs between Machilus yunnanensis and M. balansae [41], 317 SNPs between cultivated species Fagopyrum dibotrys and F. tataricum [43], 330 SNPs between Citrus sinensis and C. aurantiifolia [44], 464 SNPs between Panax notoginseng and P. ginseng [47], 591 SNPs between Solanum tuberosum and S. bulbocastanum [48], 6260 SNPs between wild species F. luojishanense and F. esculentum [43]. Out of the 107 indels found between K. galanga and K. elegans chloroplast genomes, two longest intergenic sequences petA-psbJ and atpH-atpI were detected. The petA-psbJ and partial psbA-trnH spacer sequences can be used for species identification of most Kaempferia and outgroup species [4]. Similarly, a single large 241-bp deletion in S. tuberosum clearly discriminated a cultivated potato from the wild potato species S. bulbocastanum [48]. The indels and SNPs of 12 Triticeae species chloroplasts were used to estimate wheat, barley, rye and their relatives evolution [49]. These indels and SNPs found in our study could be useful in phylogenetic analysis, species identification and evolutionary studies as well as the 65 indels detected between M. yunnanensis and M. balansae [41], 156 indels between P. notoginseng and P. ginseng [47] and 53 indels between Aconitum pseudolaeve and A. longecassidatum [50].
The chloroplast genome sequences provide a useful genomic resource for phylogenetic studies and many studies have successfully used protein-coding sequences or whole chloroplast genome sequences in these analyses [7,12,20,43]. Specifically, the chloroplast psbA-trnH and partial petA-psbJ sequences and matK gene had been utilized in Zingiberaceae species phylogenetic studies before [4,51]. In this study, we constructed phylogenetic trees using ML and MP methods based on SNPs commonly present in the chloroplast genomes of eleven species, including two Kaempferia species from the current study. Our phylogenetic analysis clearly revealed that the two Kaempferia species clustered together, with bootstrap values of 100%, as well as Amomum and Alpinia species, which segregated in two sister clades (Figure 8). In a previous study, a phylogenetic tree constructed by using whole chloroplast genome sequences strongly supported the position in the Zingiberaceae of A. kravanh as a sister of the closely related species A. zerumbet [12]. Our phylogenetic trees using SNPs were in broad agreement with the previous study [12]. Therefore, phylogenetic analysis using SNPs among chloroplast genome sequences could provide useful information for revealing relationships among Zingiberaceae species.
In conclusion, we assembled and analyzed the complete chloroplast genomes of K. galanga and K. elegans and compared them with other Zingiberaceae species for the first time. The chloroplast genomes organization, gene order, GC content and codon usage of the two Kaempferia species showed high similarity. The location and distribution of SSRs and long repeat sequences were determined. Eight highly variable regions between the two Kaempferia species were identified and 643 mutation events, including 536 SNPs and 107 indels, were accurately located. Sequence divergences of chloroplast genomes were also calculated for the two Kaempferia species and related Zingiberaceae species. The phylogenetic analysis based on SNPs among eleven species strongly supported that K. galanga and K. elegans formed a cluster within Zingiberaceae. Our results provided insights into the characteristics of the entire K. galanga and K. elegans chloroplast genomes and the phylogenetic relationships within Zingiberaceae species.

Conflicts of Interest:
The authors declare that they have no conflict of interests.