Inferring the Phylogeny and Divergence of Chinese Curcuma (Zingiberaceae) in the Hengduan Mountains of the Qinghai–Tibet Plateau by Reduced Representation Sequencing

Clarifying the genetic relationship and divergence among Curcuma L. (Zingiberaceae) species around the world is intractable, especially among the species located in China. In this study, Reduced Representation Sequencing (RRS), as one of the next generation sequences, has been applied to infer large scale genotyping of major Chinese Curcuma species which present little differentiation of morphological characteristics and genetic traits. The 1295 high-quality SNPs (reduced-filtered SNPs) were chosen from 997,988 SNPs of which were detected from the cleaned 437,061 loci by RRS to investigate the phylogeny and divergence among eight major Curcuma species locate in the Hengduan Mountains of the Qinghai–Tibet Plateau (QTP) in China. The results showed that all the population individuals were clustered together within species, and species were obviously separated; the clustering results were recovered in PCA (Principal Component Analysis); the phylogeny was (((((C. Phaeocaulis, C. yunnanensis), C. kwangsiensis), (C. amarissima, C. sichuanensis)), C. longa), (C. wenyujin, C. aromatica)); Curcuma in China originated around ~7.45 Mya (Million years ago) in the Miocene, and interspecific divergence appeared at ca. 4–2 Mya, which might be sped up rapidly along with the third intense uplift of QTP.


Introduction
Curcuma, commonly known as turmeric, has been an important flowering plant genus with medicinal, edible, and horticultural utilizations in the Orient from ancient times. Usually Curcuma are distributed in the subtropical and tropical area of Asia [1][2][3], and there are about 10 species in China (C. longa, C. sichuanensis, C. amarissima, C. yunnanensis, C. phaeocaulis, C. kwangsiensis, C. aromatica, C. wenyujin, C. flaviflora, C. exigua, and C. viridiflora) [4]. Among them, C. viridiflora and C. exigua may be extinct [5], C. yunnanensis and C. flaviflora are scarce and difficult to collect [4]. The worldwide Curcuma represent a paraphyletic group involving Hitchenia, Stahlianthus, and Smithatris in Zingibereae according to matK and ITS [6], and it has been classified into four lineages (Hitcheniopsis, Pierreana, Curcuma, and Ecomata) by using ITS and three chloroplast regions (trnL-trnF, psbA-trnH, matK) [2]. Recently, partial lineages in Curcuma have been clarified, though the phylogeny Table 1. Sample of species, with source details and total loci after processing with ipyRAD, of the 61 individuals used in phylogenetic analyses. Each sample has a sample name, species name, collecting information, and loci information. The genomic DNA isolation was carried out on fresh leaves by the CTAB method [48]. The RRS libraries for each sample were prepared using the protocol outline as previously described [49]. We used the restriction enzyme PstI (CTGCAG) to digest the extracted genomic DNA from each individual, and then ligated the resulting fragments to a barcode adaptor and a common adaptor with the correct sticky ends. Then, a Qiagen MinElute 96well PCR purification kit was used in the clean-up step to clean up the products. After PCR, the PicoGreen and a qPCR machine were used to examine the quality of the PCR products. All individuals were pooled into a single RRS library. Sample sequencing was done on Illumina Hiseq PE150 sequencer in Genepioneer Biotechnologies Co. Ltd., Nanjing, China. Raw Reads of the founder lines were deposited in the National Center for Biotechnology Information (NCBI) BioProject ID: PRJNA557061.

Clustering
The software of pipeline ipyRAD 0.7.29 [50] was used to process the raw data from the Illumina FASTQ files. The pipeline is focusing on preparing RADseq type data for population level analyses [51]. Following seven sequential steps, the ipyRAD pipeline can obtain species or higher variation across clades in clustering and alignment method based on specific parameters in the ipyRAD documentation (https://github.com/dereneaton/ipyrad (accessed on 21 January 2019)). The ipyRAD standard parameter settings were as follows: Nucleotides with Phred scores of <20 were coded as unknown bases (N), and sequences with >5% N's were thrown out. Sequences were clustered within individuals by 90% similarity via the uclust function in USEARCH [52]. Clusters of less than 10 sequences were discarded and the minimum number of individuals per cluster was set to 5. Heterozygous loci among more than two individuals were discarded. The remaining clusters were treated as loci and assembled into a phylogenetic matrix.

Concatenation-Based Species Tree Inference
The reduced-filtered SNPs were acquired from the filtered SNPs (QC) via quality control by using PLINK 1.9 with standard parameter settings. The standard parameter settings were as follows: Missingness per marker was set to 0.05. Minor allele frequency was set to 0.05. The model of substitution for data was run in MrModeltest [53] by the Akaike Information Criterion (AIC) and obtained the best model of GTR. The maximum likelihood (ML) was implemented under the GTR nucleotide substitution model in RAxML8.2.8 [54]. The maximum parsimony (MP) tree branch with booststrap support was done in the software PAUP * 4.0a134. Optimal MP trees were searched by a heuristic strategy with 1000 random sequence additions and TBR branch swapping. Bootstrap values were calculated using 1000 replicates, 10 random additions per replicate, and TBR branch swapping. Bayesian inference (BI) tree was carried out using MrBayes 3.1.2 [55]. Markov chain Monte Carlo (MCMC) searches were started from a random tree and run for 3,000,000 generations, where the topologies were sampled every 100 generations. Furthermore, 25% of our individuals (which the first 2500 trees) were discarded as burnin. The Bayesian posterior probabilities of the nodal supports were inferred and the 50% majority-rule consensus tree was constructed based on the rest of trees.

Population Structure and Divergence Time Inference
Principal component analysis (PCA) is a purely mathematical method that reflects the clustering between groups and is based on the degree of SNPs in different individuals by EIGENSOFT version 7.2.1 [56]. Population structure of 60 Curcuma individuals was obtained by ADMIXTURE 1.3.0 [57]. We predefined the ancestral proportions (K) from 3 to 12, and ran the cross-validation error (CV) procedure. The default settings and methods were used for other parameters.
On the basis of the high-quality SNPs, the BEAST version 2.5.0 was used to estimate Chinese Curcuma species divergence time, and the Bayesian tree was dated by setting divergence time between Hedychium coronarium and Curcuma as 42.4 Mya (37.4-47.4 Mya) [58]. The GTR model for nucleotide substitution and the "Bayesian skyline" tree prior model was confirmed with a standard normal prior. Substitution model and site heterogeneity model were used the optimal model based on the Bayesian analysis to select the model "relaxed clock" and MCMC runs 200 million generations, and every 1000 steps in the individuals were to ensure effective sample size (ESS) in each parameter greater than 100. The output file assessed convergence in Tracer 1.5. The phylogenetic tree used TreeAnnotator 1.5.3 to discard 25% as burnin. Finally, the divergence time was analyzed and obtained in FigTree 1.1.2 [59].

Sequences Discovery and Characterization
A total of 53.15 Gb raw data were produced by RRS (Raw Reads in NCBI BioProject ID: PRJNA557061). Then, 437,061 unique RRS loci across all the individuals were revealed by using the denove clustering method in ipyRAD. The reduced-filtered SNPs (1295 bp) were obtained under quality control (QC) in filtered SNPs by using Plink 1.9 to analyze the phylogeny, evolution, and divergence.

Phylogenetic Analyses
The MP, ML, and BI trees were obtained by reduced-filtered SNPs with strong support values (Figure 1), and the same clustering results were recovered in PCA ( Figure 2). The phylogeny shown respectively in MP, ML, and BI tree were consistent in this study.  To test the evolution of Chinese Curcuma, a Bayesian clustering algorithm with admixed models was used to estimate the ancestral proportions (K) for each specimen (Figure 3). Based on CV error, the K = 12 represented the best model for these 60 samples. When K = 3, the eight species belong to three gene pools (blue gene pools, green gene pools, and red gene pools). C. longa had independent gene pools (blue). The species of C. amarissima, C. sichuanensis, C. kwangsiensis, C. yunnanensis, and C. phaeocaulis shared green gene pools. The red gene pools were shared by C. wenyujin and C. aromatica; When K = 4 to 12, C. longa could be distinguished by constitution of gene pool, and Group I (C. wenyujin and C. aromatica) is close to Group IV (C. kwangsiensis, C. yunnanensis, and C. phaeocaulis).

Divergence Time Inference
Curcuma occurred in the Miocene (~7.45 Mya) in China (Figure 4). C. longa appeared around 6.43 Mya, the intraspecific of C. longa diversified at ~2.45 Mya from the late Pliocence to Quaternary. Group I (C. amarissima and C. sichuanensis) and II (C. kwangsiensis, C. To test the evolution of Chinese Curcuma, a Bayesian clustering algorithm with admixed models was used to estimate the ancestral proportions (K) for each specimen ( Figure 3). Based on CV error, the K = 12 represented the best model for these 60 samples. When K = 3, the eight species belong to three gene pools (blue gene pools, green gene pools, and red gene pools). C. longa had independent gene pools (blue). The species of C. amarissima, C. sichuanensis, C. kwangsiensis, C. yunnanensis, and C. phaeocaulis shared green gene pools. The red gene pools were shared by C. wenyujin and C. aromatica; When K = 4 to 12, C. longa could be distinguished by constitution of gene pool, and Group I (C. wenyujin and C. aromatica) is close to Group IV (C. kwangsiensis, C. yunnanensis, and C. phaeocaulis).

Divergence Time Inference
Curcuma

Application of "Next-Generation" Sequencing in Estimating the Phylogeny and Biological Implication of Curcuma
The recent rapid speciation of species might lead to a mistake in inference by using a single phylogenetic tree, and this might be presented a poor fit in the multi-species coalescent supermatrix data and model [60,61]. A large amount of sequence data is ideal for establishing and reconstructing phylogenetic trees [5,28,41,[62][63][64]. In this study, RRS was firstly used to produce big data to analyze the phylogeny and divergence of Chinese Curcuma. More loci and variable sites might improve the phylogeny reconstruction of Chinese Curcuma [5,41]. Compared to limited individuals and data size introduced in some previous studies, more individuals were involved to produce big data for mining more accurate information to recover a well-resolved phylogeny in this study [31].

Phylogeny Inference
The phylogeny of Curcuma species is difficult to resolve by depending on traditional approaches for their complicated origins (hybridization, introgression, and common appearance in species) [2,8,17]. The genetic relationship among Chinese Curcuma species is very close in this study, which is consistent with several previous studies [1,3,7].
Studies based on various morphological evidences reveal different results and lead to unreliable classifications of Curcuma: Based on pollen grains, Chen and Xia believed C. aromatica, C. yunnanensis, and C. longa had a close relationship, and the relationships among C. amarrisima, C. elata, C. flaviflora, C. phaecaulis, C. sichuanensis, and C. wenyujin were very close and hard to distinguish [16]. These species in Xiao et al. were delineated as Group I: C. longa, C. xanthorrhiza, and C. sichuanensis, Group II: C. kwangsiensis and C. exigua, Group III: C. wenyujin, C. aromatica, C. phaeocaulis C. zedoaria, and C. yunnanensis by

Application of "Next-Generation" Sequencing in Estimating the Phylogeny and Biological Implication of Curcuma
The recent rapid speciation of species might lead to a mistake in inference by using a single phylogenetic tree, and this might be presented a poor fit in the multi-species coalescent supermatrix data and model [60,61]. A large amount of sequence data is ideal for establishing and reconstructing phylogenetic trees [5,28,41,[62][63][64]. In this study, RRS was firstly used to produce big data to analyze the phylogeny and divergence of Chinese Curcuma. More loci and variable sites might improve the phylogeny reconstruction of Chinese Curcuma [5,41]. Compared to limited individuals and data size introduced in some previous studies, more individuals were involved to produce big data for mining more accurate information to recover a well-resolved phylogeny in this study [31].

Phylogeny Inference
The phylogeny of Curcuma species is difficult to resolve by depending on traditional approaches for their complicated origins (hybridization, introgression, and common appearance in species) [2,8,17]. The genetic relationship among Chinese Curcuma species is very close in this study, which is consistent with several previous studies [1,3,7].
Studies based on various morphological evidences reveal different results and lead to unreliable classifications of Curcuma: Based on pollen grains, Chen and Xia believed C. aromatica, C. yunnanensis, and C. longa had a close relationship, and the relationships among C. amarrisima, C. elata, C. flaviflora, C. phaecaulis, C. sichuanensis, and C. wenyujin were very close and hard to distinguish [16]. These species in Xiao et al. were delineated as Group I: C. longa, C. xanthorrhiza, and C. sichuanensis, Group II: C. kwangsiensis and C. exigua, Group III: C. wenyujin, C. aromatica, C. phaeocaulis C. zedoaria, and C. yunnanensis by using oil cells and vascular bundles [14]; and the morphology of leaves produced different results (Group I: C. longa, C. xanthorrhiza, C. wenyujin, and C. sichuanensis; Group II: C. kwangsiensis and C. exigua; and Group III: C. aromatica, C. chuanyujin, C. zedoaria, C. phaeocaulis, and C. yunnanensis) [15].

Divergence Time
The estimates of divergence indicated that the earliest appearance of Curcuma in China was in the Miocene. In the Miocene, large-scale orogenesis and geological events frequently emerged and influenced the speciation of plants living in QTP area during that times [68]. During the time, the drought climate (the Asian Monsoon) was related to intense uplift of QTP [69,70]. Curcuma lies dormant in winter, rhizomes fleshy with tuber-bearing roots, and blooms in the rainy season, which is similar to the drought-resistant plants to satisfy such drought climate [17]. The interspecific divergence of Chinese Curcuma occurred in ca. 4-2 Mya, which was coincided with the drought-resistant and deciduous plant sprang up and expanded rapidly in Hengduan Mountains ca. 4-2 Mya [71]. Chinese Curcuma expanded rapidly during this period, and the third intense uplift of QTP sped up their interspecific divergence.
In addition, hybridization and polyploidy play an important role on speciation and diversification [72,73]. According to previous studies on Curcuma, the distribution range overlapped, and the introgressive hybridization existed extensively [74]. Several Chinese Curcuma species were most likely to be of hybrid origin (C. aromatica and C. kwangsiensis) [8]. The cytomixis and chromosome doubling could promote the production of a large number of new phenotypes in a short time, and polyploids tend to be more adaptive than their parents [75][76][77]. Hybridization and polyploidization in plants are commonly distributed in the QTP regions, where the species overlap generally and the environment vary frequently, to satisfy the harsh environment of the QTP regions [78,79]. Chinese Curcuma species are confirmed to be polyploidy, which might help to adapt the harsh living environment at that time [27,74].

Future Directions
The phylogeny of major Chinese Curcuma species was improved in this study. Chinese Curcuma species are only a subset of Curcuma around the world. More species with population samples, as well as species from the genera of Hitchenia, Stahlianthus, and Smithatris, and more data should be involved to analyze the phylogeny, evolution, and diversification of Curcuma [6].
In addition, Curcuma has a complex evolutionary history; hence, more new methods such as ddRRS, ddRad, and transcriptome should be used to ensure the reliability of data for the phylogeny, evolution, and diversification reconstruction on such recent radiated genera [80].

Conclusions
The RRS was firstly involved to improve the phylogeny and divergence of Chinese Curcuma. The third intense uplift of QTP might speed up the interspecific divergence of Curcuma in China. Overall, this study provides valuable information on the origin of Chinese Curcuma.