Employing Genome-Wide SNP Discovery to Characterize the Genetic Diversity in Cinnamomum camphora Using Genotyping by Sequencing

: Cinnamomum camphora (L.) J.Presl is a representative tree species of evergreen broad-leafed forests in East Asia and has exceptionally high economic, ornamental, and ecological value. However, the excessive exploitation and utilization of C. camphora trees have resulted in the shrinking of wild population sizes and rare germplasm resources. In this study, we characterized 171 C. camphora trees from 39 natural populations distributed throughout the whole of China and one Japanese population. We investigated genetic diversity and population structure using genome-wide single-nucleotide polymorphism (SNP) identiﬁed by genotyping by sequencing (GBS) technology. The results showed the genetic diversity of the C. camphora populations from western China > central China > eastern China. Moreover, the Japanese population showed the highest diversity among all populations. The molecular variance analysis showed 92.03% of the genetic variation within populations. The average pairwise F ST was 0.099, and gene ﬂow N m was 2.718, suggesting a low genetic differentiation among populations. Based on the genetic clustering analysis, the 40 C. camphora populations clustered into three major groups: Western China, Central China, and Eastern China + Japan. Eastern China’s population had the closest genetic relationship with the Japanese population, suggesting possible gene exchange between the two adjacent areas. This study furthers our understanding of the genetic diversity and genetic structure of C. camphora in East Asia and provides genetic tools for developing strategies of C. camphora germplasm utilization.


Introduction
Cinnamomum camphora is a representative tree species of tropical and subtropical evergreen broad-leafed forests of the genus Cinnamomum in the Lauraceae family. It is one of the "bulk" trees with three major uses, including a source of spice, timber, and gardening. C. camphora is widely distributed in subtropical China and central and southern Japan [1,2]. Subtropical China is the most abundant wild resource and the main area of production of natural C. camphora essential oil globally, accounting for more than 80% of the global production [3]. However, since the 1950s, the destructive harvesting of essential oil by digging the trees and grubbing their roots, and the increasing demand for urban vegetation has led to severe reductions of the resources. At present, the natural distribution of C. camphora is scattered, and its mature natural forests are rare, which results in endangered high-quality essential oil containing precious chemical types. Therefore, it is urgent to collect and preserve the germplasm resources of C. camphora.

Plant Material and DNA Isolation
Leaf samples from 171 C. camphora trees were collected in this study from 39 natural populations in the whole range of distribution of C. camphora in China and one population in Japan. Sample sources were grouped into four regions based on geographical origin. There were three populations with a total of 14 samples from the western region in China, six populations with a total of 25 samples from the central region, 30 populations with 128 samples from the eastern region, and one population with four samples from Japan (Figure 1 and Table 1). Trees with a diameter at breast height greater than 60 cm were randomly selected, and the distance between them was greater than 50 m. Five healthy leaves were collected from each tree, dried with silica gel. Details on sampling were summarized in Table S1. DNA was extracted using the ionic detergent cetyltrimethylammonium bromide, and the purity and integrity of the DNA were detected by 1% agarose gel electrophoresis. The DNA quality and quantity were determined by a NanoDrop 2000 (Thermo Fisher Scientific Inc., Waltham, MA, USA) and a Qubit ® 2.0 Fluorometer (Invitrogen, Carlsbad, CA, USA), respectively.

GBS Library Construction and Population SNP Identification
Genomic DNA (0.1-11 µg) was digested with the ApeKI restriction endonuclease, and the adaptor and barcode were ligated at both ends by T4 DNA ligase. Then each DNA sample was amplified by PCR. After mixing the samples, the DNA fragments (350-400 bp) were recovered and purified by agarose gel electrophoresis, and the required fragments were selected for the construction of the GBS library. Finally, paired-end sequencing at 150 bp per read was performed on the Illumina HiSeq 2500 sequencing platform (Illumina, San Diego, CA, USA). Stacks 1.44 [18] was used to cluster the high-quality data of each individual based on sequence similarity. During clustering, low-depth stacks were filtered out, leaving high-depth stacks as the clustering results. The mismatched bases were corrected in each stack to eliminate the impact of sequencing errors and avoid false-positive SNP. Clean data of sample ZJCA-157 and JXAF-115 were selected to be combined and assembled as the reference genome under the parameter (ustacks -t gzfasta -i 1 -m 1 -M 6 -N 6 -p 6). BWA (Burrows-Wheeler-Aligner) was used to compare the high-quality and effective sequencing data against the reference genome (using the assembly results of samples ZJCA-157 and JXAF-115 as the reference genome) [19]. The SNP in the population were detected with SAMTOOLS software. The SNP in the population to be tested were detected using the Bayesian model. The obtained SNP were filtered with the SAMTOOLS [20] settings main parameter -minDP4-max miss0.1-maf0.01. The number of bases support for each SNP was not less than 1. Due to repeat regions in the genome, the number of bases support for each SNP should not be greater than 7000 to reduce the number of false-positive SNP caused by repeat region alignment errors. The genotype quality value was ≥5.

Statistical Analysis of Data
The distance matrix was calculated using TreeBest-1.9.2 software (http://treesoft. sourceforge.net/treebest.shtml, accessed on 28 October 2021). The phylogenetic tree was constructed through 1000 calculations using the neighbour-joining method [21]. The eigenvectors and eigenvalues were calculated using GCTA (http://cnsgenomics.com/software/ gcta/pca.html, accessed on 28 October 2021) software, and the principal component anal- ysis (PCA) distribution map was plotted using R software [22]. The population genetic structure and individual allocation information were constructed using frappe software [23]. The maximum iteration of the expectation-maximization algorithm was set to 10,000 in the Frappe analysis. We predefined the number of genetic clusters from K = 2 to K = 4. Based on the filtered SNP loci, genetic diversity indicators including the observed heterozygosity (Ho), expected heterozygosity (He), interpopulation genetic differentiation coefficient (F ST ), and interindividual inbreeding coefficient (F IS ) of each population were calculated by Arlequin software [24]. Analysis of molecular variance (AMOVA) was implemented in GenAlEx v6.5 [25]. Analysis of population nucleotide diversity (π) was performed with Stacks 1.44 [18].

GBS Analysis and SNP Identification
Total 138.24 Gb bases were generated by GBS sequencing from the 171 C. camphora trees (Table S2). After cleaning out low-quality sequences, 138.23 Gb of high-quality sequence data (Q20 ≥ 93.34%, Q30 ≥ 85.00%) remained, for an average of 0.81 Gb per sample. The average GC content was 38.24%. A total of 960 million clean reads were obtained, ranging from 2,878,540 (HBCB-40) to 13,528,004 (HNCZ-55), with an average of 5,613,668 per sample. An average of 5,389,736 reads (96.18%) was aligned to a reference genome, with the highest (97.18%) alignment rate for GZDZ-18 and the lowest (73.50%) rate for ZJCA-157. The reference genome coverage was in an average of 11.82% 1× genome (between 8.2% and 16.1%), and an average of 5.5% 4× genome (between 4.12% and 7.41%). A total of 130,450 SNP loci were obtained based on the comparison and detection by SAMTOOLS software. After filtering, 14,718 high-quality SNP loci were used for further analysis.

Population Genetic Structure
To reveal the population genetic structure of C. camphora, we divided the 40 populations into four groups, eastern China, central China, western China, and Japan. The neighbour-joining tree showed that, except for the HNCZ-54 trees from the east and HNHH-31 from the west regions, the other 169 C. camphora trees were clustered into three clades ( Figure 2). Cluster I was composed of 18 C. camphora trees, including 16 from the central region and two from the CQDZ population in the western region (11,12). Cluster II included 15 trees mainly from the west region, in which two trees from the central region  To detect the highest population structure level, we performed the analysis when the parameter K was between 2 and 4 ( Figure 3). The results showed that when K = 2, the trees from central China and the HNCZ-54, CQDZ-11, and CQDZ-12 from the western region were first separated from the others and formed subgroup I. When K = 3, trees from the  To detect the highest population structure level, we performed the analysis when the parameter K was between 2 and 4 ( Figure 3). The results showed that when K = 2, the trees from central China and the HNCZ-54, CQDZ-11, and CQDZ-12 from the western region were first separated from the others and formed subgroup I. When K = 3, trees from the west region were separated from subgroup II and formed subgroup III. When K = 4, subgroup IV was further divided into the populations from Japan and eastern China, and the Japanese C. camphora population was finally clustered into its own subgroup. These results are consistent with the above phylogenetic tree analysis.  PCA showed that the 171 C. camphora trees clustered three main groups ( Figure 4). All trees can be significantly separated into two groups: central China and eastern China + Japan, according to the first principal component (PC1). In comparison, the western China populations showed a certain level of differentiation from the others two groups on the second principal component (PC2). This result is consistent with the clustering tree ( Figure 2) and the population structure analysis (Figure 3). All these data indicated a clear geographical pattern of the genetic structure of the C. camphora populations.

Genetic Diversity in C. camphora
At the species level, the average values of Ho and He of C. camphora were 0.365 and 0.352, respectively ( Table 2). The analysis of genetic diversity in each region showed that the population from Japan had the highest level of genetic heterozygosity (Ho = 0.504, He = 0.     Analysis of molecular variance (AMOVA) of the C. camphora populations from the four regions (Table 3) showed that 7.02% of the genetic variation was between the regions. The genetic variation between the populations from the same region was only 0.95%. However, 92.03% of the genetic variation was among the individuals within each population. Among the four regions, the mean genetic differentiation coefficient F ST was 0.099, gene flow (Nm) was 2.718, indicating a moderate genetic differentiation between populations. The genetic differentiation among eastern, western, and central China was relatively low (F ST = 0.052-0.071, 0.064 on average), less than the differentiation between China and Japan (F ST = 0.096-0.152, 0.133 on average). The genetic differentiation between the eastern China population and the Japanese population was low (0.096). In contrast, a relatively high level of genetic differentiation between western or central China and Japan was observed, 0.152 and 0.151, respectively. Gene exchange between different regions in China was relatively frequent (Nm = 3.271-4.558), but the gene flow between the Chinese population and the Japanese population was limited (Nm = 1.395-2.354), indicating that geographical isolation has exacerbated the genetic differentiation between the C. camphora populations (Table 4).

Genotyping by Sequencing
Traditionally, molecular markers such as random amplified polymorphic DNA, amplified fragment length polymorphisms, and SSRs have been widely used to analyze plant genetic diversity. With the rapid development of NGS technology, a new method that is simple but powerful, GBS, has achieved high-efficiency and large-scale genome-wide SNP identification in some species [8]. In 2011, Elshire et al. used GBS to investigate maize (IBM) and barley (Oregon Wolfe Barley) populations. To establish a library with reduced genomic complexity, they identified approximately 200,000 and 25,000 sequence markers, respectively. The analysis showed the potential of this library for marker screening in large-genome species [10]. Pootakham et al. obtained 524,508,111 reads from oil palm (Elaeis guineensis Jacg.) using GBS technology, approximately 88% reads were aligned to the reference genome, and 21,471 SNP loci were identified, which was significantly lower than the alignment rate of reads on the reference genome in this study (96.18%) [26]. Gurcan et al. performed SNP detection on apricots by the GBS technique and generated approximately 28 Gb of high-quality data after filtering [27]. Wu et al. applied GBS technology to 75 samples of Lauraceae plants, including 19 samples of C. kanehirae, 31 of C. camphora, and 25 putative hybrids, for breed identification and genetic analysis. In total, 529,006 SNPs were obtained in the calling analysis, though after filtering, 840 high-quality SNPs were obtained for subsequent analysis. Among them, C. camphora groups in eastern and south-western Taiwan had 201 SNP loci, while the western and northern groups (including Vietnam and Japan) had 376 SNP loci [9]. In this study, by sequencing 171 C. camphora trees, we obtained 480 million high-quality reads and identified 130,450 SNPs. After filtering, 14,718 high-quality SNPs were used for subsequent genetic diversity and genetic structure analysis. The quality of the obtained reads and SNP loci was as high as the studies from Pootakham et al. and Gurcan et al.

Genetic Diversity of C. camphora
Genetic diversity refers to the degree of genetic polymorphism between different populations or individuals within a species. Rich genetic diversity is the basis for a species to respond to environmental changes and for people to carry out genetic improvement programs [28,29]. Based on 22 SSR markers, Kameyama evaluated the genetic polymorphism of 104 C. camphora samples from three Japanese populations: Meiji Jingu (Shinto Shrine), Kajiya Plantation, and Manazuru Peninsula. They found the average Ho, and He of each population was 0.53-0.60 and 0.55-0.68, respectively, suggesting a high level of genetic diversity [1]. Kameyama et al. further performed genetic diversity analysis on 817 C. camphora individuals from six populations, including three populations (Fujian, Shanghai, and Taiwan) from China and three populations (Eastern and Western Kyushu regions) from Japan. This study showed that a high level of genetic diversity in the C. camphora population, in which the Chinse populations (Ho = 0.728, He = 0.878) showed significantly greater diversity than that of the Japanese populations (Ho = 0.614, He = 0.714) [2]. In comparison to the previous study, our previous studies showed a moderate level of genetic diversity (Ho = 0.30-0.61, He = 0.31-0.48) of 41 Chinese C. camphora natural populations from the whole range of distribution of C. camphora in southern China using EST-SSR markers [4,5]. Interestingly, by using the GBS technique, Wu et al. observed a significantly lower genetic diversity in the population, including 75 individuals from Taiwan, 19 C. kanehirae, 31 C. camphora, and 25 putative hybrids. The population genetic diversity (Ho = 0.108, He = 0.123) of C. camphora in western and northern Taiwan was slightly higher than that in eastern and south-western Taiwan (Ho = 0.085, He = 0.076), and the Ho of C. camphora was greater than that of C. kanehirae (Ho = 0.076) [9]. In this study, we analyzed the genetic diversity of the Chinese and Japanese C. camphora populations through SNP markers generated by the GBS technique. We found that the genetic diversity of the Chinese C. camphora populations was relatively high (Ho = 0.319, He = 0.322), which is consistent with our previous findings based on SSR markers [4,5]. However, our populations showed higher genetic diversity than those found by Wu et al. [9]. One possible reason is that the geographical range of populations in Taiwan is small, and the cultivation history of C. camphora is long. During the process of the breeding program, a limited number of selected elite trees led to lower diversity [30]. Moreover, the asexual reproduction process using cuttings has further reduced the genetic diversity of C. camphora. In this study, a total of 11 Chinese C. camphora trees were clustered into different regions (Figure 2). And these trees were obtained from recent cultivations (Table S1). Kameyama et al. showed that human activity had significantly affected gene exchange between C. camphora populations [2]. The genetic diversity of a tree species is typically determined by its geographical distribution range and population size [31,32]. Although C. camphora is widely distributed in southern China, its origin is limited to East Asia and the Indochina Peninsula. The limited geographical distribution inhibits the gene exchange among populations and enhances genetic diversity within the population [33]. In addition, this study, together with the findings from Wu et al., showed lower genetic diversity of C. camphora in China than that of the Japanese population, which was not consistent with the results of Kameyama et al. [2]. This inconsistency may be due to different sampling strategies, such as inconsistencies in the source and size of the population used [32,34]. The difference between the GBS-SNPs and SSRs used in these studies could be another reason for the different findings.

Genetic Differentiation and Population Genetic Structure of C. camphora
Studying the genetic structure of plant populations is critical to the understanding and maintenance of plant genetic diversity. The genetic structure of populations is affected by the breeding system, genetic drift, population size, seed dispersal, gene flow, evolutionary history, and natural selection [2,35]. As an essential indicator of the population's differentiation, the F ST ranging from 0.05 to 0.15 indicates moderate genetic differentiation between populations [36,37]. In this study, the F ST of C. camphor between the four areas was 0.052-0.152 (Table 3), indicating a moderate genetic differentiation of C. camphor groups. AMOVA showed 92.03% of the genetic variation was within the population, leaving the genetic variation between regions and populations at only 7.97%. These results are consistent with previous studies on C. camphora [2,4] and other tree species in southern China [38,39]. The Nm value also suggested frequent gene exchange between the populations in the region, indicating that the gene exchange offsets the effect of genetic drift to some extent [40]. Considering that C. camphora is an insect-pollinated plant, the impact of insects on the long-distance transmission of its pollen may be relatively small. The seeds of C. camphora are taken by various birds in the south, so the gene flow between populations is likely to be carried out by long-distance transport of seeds. Hence, the low level of genetic differentiation between populations can be explained by the long-distance transmission and transport of pollen or seeds [41]. In addition, considering the riparian habitat of C. camphora, its seed dispersal is easily affected by the regional water system. On the other hand, the insect-pollinated plants with dominant outcrossing provide a new isolation method for the reproductive system, reduce pollen flow between populations, and thus increase genetic differentiation and erosion among populations [4].
The phylogenetic tree showed our 171 C. camphora trees from 40 populations clustered into three major groups, indicating an apparent geographical distribution pattern. Southwest China is considered one of the hot spots of species diversity, with rich species diversity, and is also the origin and differentiation centre of many plants and glacial refuges [2,42]. The western regional populations of the Yunnan-Guizhou Plateau constituted subgroup I, with the highest genetic diversity, including YB and LZ in Sichuan and DZ in Chongqing. 137 C. camphora trees from the Japanese population and eastern China (excluding only HNCZ-761) constituted the largest subgroup ( Figure 2, Cluster III), with the lowest genetic diversity. These regional populations in China are in Jiangxi, Anhui, and Hunan Provinces. Their border areas with neighbouring provinces have the most extensive distribution and the most extant ancient C. camphora populations in China. The C. camphora population in central China was clustered into a single subgroup II, with two exceptionals of the Hubei YC and Hunan SF populations into subgroup III, indicating a certain degree of gene exchange within eastern China. Similar results were obtained from PCA and population structure analysis. Mountains play a crucial role in the extinction and preservation of subtropical Chinese species and the generation of new species and new population structures. Due to the geographical isolation caused by mountains, gene exchange across mountains is hindered, resulting in significant differentiation of genetic structure and increased genetic diversity [43][44][45][46]. It was speculated that the Japanese C. camphora originated from China [47]. Moreover, according to our study, the C. camphora species formation was relatively late, approximately in the late Tertiary period (unpublished data). Multiple transgressions and regressions in the East China Sea in the fourth century affected the segregation and connection of eastern China, southern Japan, and the Korean Peninsula, which significantly impacted the genetic structure of plants [48][49][50][51]. This genetic exchange led to the close genetic relationship between the C. camphora populations in eastern China and Japan.

Conservation Strategies for C. camphora
In situ conservation is an effective measure to protect wild germplasm resources [52]. The native habitat of C. camphora is widely distributed throughout East Asia and China in the region with the most abundant wild resources. In recent decades in southern China, many wild C. camphora has been cut down to extract essential oil by destructive methods, such as digging trees and grubbing roots. Meanwhile, many ancient C. camphora (defined as diameter at breast height exceeding 100 cm) were transplanted into the city during the urban vegetation process. The ageing, physiological function decline, and habitat deterioration have also aggravated the weakening and death of some ancient C. camphora trees. Altogether, these factors caused the shrinking of existing wild C. camphora resources. Gene flow is a crucial factor affecting the genetic diversity and genetic differentiation of species [53]. The breeders' exchange of seedlings and seeds has significantly affected gene flow between C. camphora populations in China and Japan [2]. Transplantation of large and ancient trees should be avoided to maintain the genetic structure of C. camphora in each region. At the same time, as an insect-pollinated plant, factors such as genetic erosion and genetic drift may also affect the level of genetic diversity of the offspring of C. camphora. There are frequent gene introgressions between C. camphora and other species of the same genus [9], so ex-situ conservation is another important way to protect wild species. We should include as many individuals with different haplotypes and evolutionarily significant units as possible for ex-situ introduction. Attention should also be given to the mutual introduction of various populations to increase their gene exchange and diversity, to resist the negative effects of genetic drift, thereby reducing the risk of genetic diversity reduction in wild populations.

Conclusions
This is the first study to analyze the genetic diversity and genetic structure of C. camphora in East Asia, including 39 populations in its whole distribution range in China and one Japanese population, based on the GBS technique. The results showed that the genetic diversity of the C. camphora populations in China were significantly lower than that of the Japanese population. The AMOVA, pairwise F ST, and Nm analysis indicated that the genetic differentiation between populations was quite low, as the genetic variation was mainly within each population. Phylogenetic tree, PCA, and population structure analysis showed that the 40 populations were clustered into three major groups, i.e., western China, central China, and eastern China + Japan, showing a clear geographical distribution pattern. The phylogenetic analysis showed the closest relationship between the populations in eastern China and Japan. There is significant gene exchange between adjacent regions, and due to geographical isolation, the gene exchange between Chinese and Japanese populations is relatively weak. Because the Japanese population sample size in this study was small, further studies will be needed with increased sample size from the Japanese population, and populations from Taiwan and Vietnam to further confirm the findings of existing studies. In any case, this study's results will help further understand the current status of genetic diversity and genetic structure of C. camphora in East Asia and provide a theoretical basis for forming appropriate conservation measures and genetic improvement strategies.