Comparative Analysis of Complete Chloroplast Genomes of Nine Species of Litsea (Lauraceae): Hypervariable Regions, Positive Selection, and Phylogenetic Relationships

Litsea is a group of evergreen trees or shrubs in the laurel family, Lauraceae. Species of the genus are widely used for a wide range of medicinal and industrial aspects. At present, most studies related to the gene resources of Litsea are restricted to morphological analyses or features of individual genomes, and currently available studies of select molecular markers are insufficient. In this study, we assembled and annotated the complete chloroplast genomes of nine species in Litsea, carried out a series of comparative analyses, and reconstructed phylogenetic relationships within the genus. The genome length ranged from 152,051 to 152,747 bp and a total of 128 genes were identified. High consistency patterns of codon bias, repeats, divergent analysis, single nucleotide polymorphisms (SNP) and insertions and deletions (InDels) were discovered across the genus. Variations in gene length and the presence of the pseudogene ycf1Ψ, resulting from IR contraction and expansion, are reported. The hyper-variable gene rpl16 was identified for its exceptionally high Ka/Ks and Pi values, implying that those frequent mutations occurred as a result of positive selection. Phylogenetic relationships were recovered for the genus based on analyses of full chloroplast genomes and protein-coding genes. Overall, both genome sequences and potential molecular markers provided in this study enrich the available genomic resources for species of Litsea. Valuable genomic resources and divergent analysis are also provided for further research of the evolutionary patterns, molecular markers, and deeper phylogenetic relationships of Litsea.


Introduction
Litsea is an evergreen tree or shrub and is one of the most diverse genera (about 400 species) in the family Lauraceae (Mesangiospermae: Magnoliids: Laurales). It is widely distributed in tropical and subtropical Asia, North and South America [1,2], and 74 species are located in China, at a maximum elevation of 2700 m above sea level [3] Species of Litsea are utilized in a wide range of applications, covering medical, agricultural, industrial, and many other fields. Litsea can be used to treat a variety of conditions such as diarrhea, stomach pain, indigestion, the common cold, gastroenteritis, diabetes, edema, arthritis, asthma, pain, and trauma [1]. In addition, Litsea is also known for the highly effective properties of its essential oil against food-borne pathogens [4]. Its essential oils can also be resistant to several types of bacteria, have antioxidant, anti-parasitic, acute toxicity, genotoxic, and cytotoxic properties, and can even prevent several types of cancer [5][6][7]. Despite the pharmaceutical applications of Litsea, it is also widely used as feed for silkworm pupae, especially for muga silkworms (Antheraea assama) [5]. In comparison with ordinary

Sample Collection, DNA Extraction, and Sequencing
In this study, samples of nine species of Litsea were collected from the Plant Germplasm and Genomics Center, Kunming Institute of Botany, the Chinese Academy of Sciences. The process of sample collection was approved by the Kunming Institute of Botany and local policy and deposited in the Evolutionary Biology Laboratory of Qingdao University of Science and Technology. Fresh leaf tissues were collected without apparent disease symptoms and preserved in silica gel. Total genomic DNA was extracted from 150 mg of silica-dried leaf tissues using modified CTAB [20]. The quantity and quality of the extracted DNA was assessed by spectrophotometry while the integrity was evaluated using a 1% (w/v) agarose gel electrophoresis [19] The Illumina TruSeq Library Preparation Kit (Illumina, San Diego, CA, USA) was used to prepare approximately 500 bp of paired-end libraries for DNA inserts, according to the manufacturer's protocol. These libraries were sequenced on the Illumina HiSeq 4000 platform in Novogene (Beijing, China).

Chloroplast Genome De Novo Assembly and Annotation
The raw data were preprocessed using Trimmomatic 0.39 software [21], including the removal of adapter sequences and other sequences introduced during sequencing, the removal of low-quality and over-N-base reads, etc. The quality of newly produced clean short reads was assessed using FASTQC v0.11.9 [22] and MULTIQC software [23]. High-quality data with Phred scores averaging above 35 were screened out. According to the reference sequence (Litsea glutinosa, KU382356), the chloroplast-like reads were isolated from clean reads by BLAST [24]. Short reads were de novo assembled into long contigs with SOAPdenovo 2.04 [25] by setting kmer values of 35, 44, 71, and 101. Finally, the longcontigs complete sequence expansion and gap filling was done by Geneious ver 8.1 [26], which forms the complete cp genome. The complete cp genome was further validated and calibrated using de novo splicing software NOVOplsty 4.2 [27]. GeSeq [28] was used to annotate the assembled genomes, and tRNAscanSE ver 1.21 [29] was applied to detect tRNA genes with default settings. RNAmmer [30] was used to validate rRNA genes with default settings. As a final check, we compared the results with the reference sequence and corrected misannotated genes by GB2Sequin [31] by manual selection. The circular map of the genomes was drawn using CHLOROPLOT [32]. The nine newly assembled Litsea cp genomes were deposited in GenBank with the accession numbers NC_056809-NC_056817.

Analysis of Chloroplast Genome Characteristics
Information regarding the GC content, genome length, and number of each region in cp genomes was obtained using Geneious ver 8.1 software [26]. Relative synonymous codon usage (RSCU) was calculated by the Computer Codon Usage Bias function in MEGA X [33]. SSRs were identified using MISA [34], with a setting of ten repeats for mononucleotide SSRs, four for dinucleotide and trinucleotide SSRs, and three for tetranucleotide, pentanucleotide, and hexanucleotide SSRs. REPuter [35] was used to identify four types of repeats with the minimum repetition unit set as 20 bp and the maximum as 300, and the remaining options set to default parameters.

Comparative Analysis
To compare the gene differences among the nine species, Litsea garrettii (NC_050349) was selected as the reference species and the online comparison tool mVISTA [36] was used for sequence alignment. IRscope [37] was used to detect and visualize the contraction and expansion of IRs boundaries. SNP and InDels were detected using Geneious ver 8.1. The Ka/Ks value was batch evaluated by TBtools with the NG method [38]. DnaSP 6 was used to analysis the nucleotide diversity (Pi) value [39].

Phylogenetic Analysis
We downloaded 12 further cp genomes of Litsea from NCBI (National Center for Biotechnology Information). Two species from Lauraceae but in different genera, Actinodaphne obovate and Neolitsea sericea, were selected as outgroups to root our phylogenetic networks. A total of 23 species were compared for phylogenetic evaluation using maximum likelihood (ML) and Bayesian inference (BI) approaches. MAFFT v7 was used to perform multiple genome alignment [40], and we used the complete cp genome sequence data as well as a separate dataset of 64 protein-coding genes shared by all species to construct individual maximum likelihood (ML) topologies. The GTR+G+I model was evaluated as the best suit model for both CDS and all cp sequences by applying the Bayesian information criterion (BIC) using jmodeltest v2.1.7 [41]. The ML analyses were performed using MEGA X [33], and bootstrap tests were performed with 1000 replicates with tree bisection-reconnection branch swapping. MrBayes v3.2.7 [42] was used to the BI analysis, for two million generations, sampled every 100 generations, with all other settings left at their defaults and 25% of the trees discarded as burn-in.

Chloroplast Genome Features of Litsea
With reliable quality control, we filtered about 22.5 GB of high quality, 2 × 150 bp pair-end reads generated by the Illumina HiSeq 4000 platform. The mean coverage of sequencing was 1750 X. The cp genome features of nine species were analyzed and the total length ranged from 152,051 to 152,747 bp ( Figure 1). 128 genes were found in these complete cp genomes, including 36 tRNA genes, eight rRNA genes, and 84 protein-coding genes. These genes can be divided into three categories: self-replication related, photosynthesis related, and other genes. The large subunit of ribosomal proteins, small subunit of ribosomal proteins, DNA-dependent RNA polymerase, rRNA genes, and tRNA genes belong to the Self-replication category. Photosystem I, Photosystem II, NADH oxidoreductase, Cytochrome b6/f complex, ATP synthase, and Rubisco belong to the Photosynthesis category. The remaining genes that have not been authorially classified yet were attributed to the other genes category (Table 1) [43].
Genes 2022, 13, x FOR PEER REVIEW Figure 1. Complete genome map of the chloroplast genome for representative species L The inner gray ring is divided into four areas, clockwise, and they are SSC, IRb, LSC, an genes in the outer ring region are transcribed clockwise, while those in the inner ring are counterclockwise. In addition, this figure also reflects the GC content; the inner ring dar cates the GC content, the light gray reaction AT content. In the lower left is a legend th cp genes according to their functions.

Group of Genes
Gene Names Pholosystem I psaA, psaB, psaC, psal, psaJ The inner gray ring is divided into four areas, clockwise, and they are SSC, IRb, LSC, and IRa. The genes in the outer ring region are transcribed clockwise, while those in the inner ring are transcribed counterclockwise. In addition, this figure also reflects the GC content; the inner ring dark gray indicates the GC content, the light gray reaction AT content. In the lower left is a legend that classifies cp genes according to their functions.  Other genes accD, clpP, ccsA, cemA, infA, rpoA, matK 7 * Gene with two copies.
Typical quadripartite and circular structures were discovered. These cp genomes contain a large single-copy (LSC) of 93,093-93,631 bp and a small single-copy region (SSC) of 18,813-18,902 bp, separated by two identical interspersed regions (IRs) of 20,014-20,117 bp. Among the four types of regions, the LSC region contained the largest number of genes, including 66 protein-coding genes and 23 tRNA genes. The SSC region contained only 11 protein-coding genes and one tRNA gene, but its average gene length was the longest at 1100 bp. Two identical IR regions contained five protein-coding genes, six tRNA genes, along with four rRNA genes ( Table 1). The genome features of Litsea are consistent with the basic structure of cps reported by other studies [44].
We also analyzed the GC content of the complete cp genome for the nine species of Litsea, as well as the values of each region (Table 2). We discovered that the average GC content of the full cp genome was 39.2% for all species except for L. sericea, which was 39.1%. In addition, the GC content of the IR region was firmly consistent at 44.4% and significantly higher than the other two regions, which was assumed to be related to the presence of many rRNA genes [45]. 152

Codon Usage Analysis
All organisms share the same common codon table, reflecting the shared ancestry of all life. But through the process of biological evolution, disproportionate biases have evolved. Different species exhibit certain preferences for not only different synonymous codons, but also different proteins within the same species may show a preference for the same amino acid, a phenomenon called codon bias. A measurement called RSCU is commonly used to reflect the codon bias, which removes the effect of the amino acid composition of a codon [44]. Since L. moupinensis had the largest cp genome, we used it as an example to calculate the codon usage bias and RSCU values of 84 CDS genes. The protein-coding genes in the complete cp genome of Litsea consist of 84 genes coded by 61 codons, which encode 20 amino acids. The results showed that Leu (UUA), Ala (GCU), and Arg (AGA) are the most frequently used amino acids, while Ser (AGC) and Arg (CGC) are the least abundant amino acids ( Figure 2). RSCU values greater than one mean that there is significant codon bias. This results in a different use of amino acids, which correlates with protein-positive bias [45]. Analysis of RSCU values of the codons encoding each amino acid revealed that most codons with RSCU > 1 contained either an A-or G-terminal. By contrast, RSCU values for codons that ended with a C-terminal, such as CGC (Arg), UGC (Cys), CAC (His), and AGC (Ser), were relatively low. This result was consistent with previous related reports [46].

Codon Usage Analysis
All organisms share the same common codon table, reflecting the shared ancestry of all life. But through the process of biological evolution, disproportionate biases have evolved. Different species exhibit certain preferences for not only different synonymous codons, but also different proteins within the same species may show a preference for the same amino acid, a phenomenon called codon bias. A measurement called RSCU is commonly used to reflect the codon bias, which removes the effect of the amino acid composition of a codon [44]. Since L. moupinensis had the largest cp genome, we used it as an example to calculate the codon usage bias and RSCU values of 84 CDS genes. The proteincoding genes in the complete cp genome of Litsea consist of 84 genes coded by 61 codons, which encode 20 amino acids. The results showed that Leu (UUA), Ala (GCU), and Arg (AGA) are the most frequently used amino acids, while Ser (AGC) and Arg (CGC) are the least abundant amino acids ( Figure 2). RSCU values greater than one mean that there is significant codon bias. This results in a different use of amino acids, which correlates with protein-positive bias [45]. Analysis of RSCU values of the codons encoding each amino acid revealed that most codons with RSCU > 1 contained either an A-or G-terminal. By contrast, RSCU values for codons that ended with a C-terminal, such as CGC (Arg), UGC (Cys), CAC (His), and AGC (Ser), were relatively low. This result was consistent with previous related reports [46].

Long Repeat and SSR Analysis
Long repeat sequences and SSRs analysis commonly exist throughout the cp genome, consisting of one to six nucleotide repeats [44]. Due to its variability at the intraspecific level, SSRs are commonly used as markers in population genetic analyses [47,48]. In the cp genome of nine species, the total number of repeats ranged from 109 (L. chuni) to 119 (L. auriculata) (Table S1). A total of 111 SSRs were detected from the cp genome of the representative species L. moupinensis, including 62 mononucleotide, 36 dinucleotide, tree trinucleotide, eight tetranucleotide, one pentanucleotide, and one hexanucleotide repeats. In general, the SSR number decreased along with the increase in nucleotide number. The percentage of tri-, tetra-, penta-, and hexa-nucleotide repeat sequences detected were remarkably lower than that of mono-and di-nucleotide repeat. Mono-nucleotide repeats were the largest class of SSRs that consisting of 56.97% of all repeats. These repeats were notably rich in A/T bases, causing the differences in terms of base content, which was in line with other angiosperm species [49]. We also analyzed the distribution of SSRs in LSC/SSC/IR regions. The number of SSR markers in the LSC region of nine species of Litsea ranged from 79 to 87, far exceeding that of SSC (19) and IR regions (12). In particular, IR region contains the lowest number of SSRs, which further demonstrates the high degree of conservatism of IR regions. This correlated phenomenon was previously reported in other angiosperms studies [50].
Some repeats larger than 30 bp in length are called long repeat sequences, which increase the rearrangement of the cp genome [51]. We investigated the interspersed repeated sequences (IRs) including four types of long repeat sequences: complement repeats (C), forward repeats (F), palindromic repeats (P), reverse repeats (R). In general, palindromic repeatswere richest in most species, followed by forward repeats and reverse repeats. Complement repeats (C) were notably rare among all species. However, in the cp genome of L. ichangensis, the number of forward repeats were slightly higher than that of palindromic repeats (16). What more, in the cp genome of L. auriculata, the ratio of reverse repeats (R) was more than that of forward repeats (F), which is also different from the other eight species ( Figure 3A). Most repeats were found in LSC region, leaving SSC and R regions far behind. This pattern is highly consistent in nine Litsea species analyzed ( Figure 3B). We also measured the number of long repeat sequences with different lengths ( Figure 3C). It was found that long repetitive sequences of length of 20 and 21 bp were most common, while the remainder decreased in number with an increase in length in general. The repeats with 29, 31, and 38 bp in length were almost absent. However, in 33 bp and 44 bp, the repeats number presented to be tied for third place suddenly. This phenomenon varied from different species, which may be affected by unknown molecular mechanisms [52,53].

IR Contraction Analysis and Sequence Identity Plot
The contraction and expansion of IR regions contribute greatly to variations of cp genomes among different species, resulting in gene duplication, deletion, and the generation of pseudogenes. Studying the characteristic genes of the border region contributes to species identification and phylogenetic analyses [54]. In this study, we analyzed and visualized the genes located in the junction region of LSC and IRa (JSa) as well as the junction of SSC and IRb (JSb) in the cp genome of the nine species of Litsea (Figure 4). JLa represents the junction between LSC and IRa, and the same applies for JLb. In this study, we observed that genes located in the junction of four regions were highly conserved, with only a few variations. Most genes located at cp genome junctions in all nine species differed only in the distance to their corresponding boundaries, such as ycf2, ndhF, trnH, and psbA. To be more specific, the ycf2 gene spans LSC/IRb and is distributed in both regions of similar length, with the LSC region being slightly longer. The ndhF gene exists among nine species, completely located in SSC and a short distance from IRb except for L. sericea, of which theirs was longer and closer to the JSb boundary. The trnH gene is located in the LSC region, adjacent to the IRa/LSC border, and is 21-22 bp in length. PsbA is located entirely in the LSC region. Yet, notable variations were found. The ycf1 gene was absent in this junction, while the remaining eight species contain ycf1 Ψ (pseudocopy, 5 end missing) in JSb, which spans JSb with only 4-5 bp of length, located in SSC. Apart from that, the contraction and expansion event located in the JSb was greater than that of the JLa boundary. This pattern is consistent with previous IR region research [55].
The whole sequence identity plot of nine species within Litsea was analyzed using mVISTA with L. garretti (NC_050349) set as the reference sequence for comparison ( Figure 5). Genome sequences of the nine Litsea exhibited a high degree of concordance. In this study, we revealed that most of the variations in the cp genome of different species were distributed in CNS (non-coding sequences) regions. Notable high-divergent regions in CNS were atpF-atpH and ndhC-trnV-UAC, the divergent value of which exceeded 100%. Other variant regions include: rps16-trnQ-UUG, ycf4-cemA, rps8-rpl14, and rps12-trnV-GAC. Some of the coding genes, such as ndhK, ndhF, and ycf1, were found to were contain variable regions. In general, the divergence in the IR region was significantly smaller than that in the LSC and SSC regions, a result comparable to the previous divergence analysis [50].

IR Contraction Analysis and Sequence Identity Plot
The contraction and expansion of IR regions contribute greatly to variations of cp genomes among different species, resulting in gene duplication, deletion, and the generation of pseudogenes. Studying the characteristic genes of the border region contributes to species identification and phylogenetic analyses [54]. In this study, we analyzed and visualized the genes located in the junction region of LSC and IRa (JSa) as well as the junction of SSC and IRb (JSb) in the cp genome of the nine species of Litsea (Figure 4). JLa represents the junction between LSC and IRa, and the same applies for JLb. In this study, we observed that genes located in the junction of four regions were highly conserved, with only a few variations. Most genes located at cp genome junctions in all nine species differed only in the distance to their corresponding boundaries, such as ycf2, ndhF, trnH, and psbA. To be more specific, the ycf2 gene spans LSC/IRb and is distributed in both regions of similar length, with the LSC region being slightly longer. The ndhF gene exists among

SNP and InDels
To further explore the divergence of nucleotides, we compared and analyzed single nucleotide polymorphisms (SNPs) and insertions and deletions (InDels) using L. garrettii (NC_050349) as a reference sequence. The polymorphism ratio of transition substitution (Ts) was higher than transversion substitutions (Tv) in the LSC region of nine cp genomes ( Table 3). The most substitutions were located in the LSC region, while IR regions contained the lowest rate of polymorphisms. This result is consistent with previous studies [56]. In terms of transition substitutions, the polymorphism ratios of A/G and C/T were almost the same, although the former took up a slightly larger proportion, with only three exceptions (L. auriculata, L. chunii, and L. tsinlingensis). As for transversion substitutions, the polymorphism ratios of A/T and C/G were greatly lower than that of A/C and G/C substitutions. The same pattern applied for InDels (Table 4). LSC presented the largest number of InDels in comparison with IR and SSC regions, while the average length of InDels in IR regions was the longest, with the longest variation length being 678 bp (L. tsinlingensis).
of which theirs was longer and closer to the JSb boundary. The trnH gene is located in the LSC region, adjacent to the IRa/LSC border, and is 21-22 bp in length. PsbA is located entirely in the LSC region. Yet, notable variations were found. The ycf1 gene was absent in this junction, while the remaining eight species contain ycf1 Ψ (pseudocopy, 5′ end missing) in JSb, which spans JSb with only 4-5 bp of length, located in SSC. Apart from that, the contraction and expansion event located in the JSb was greater than that of the JLa boundary. This pattern is consistent with previous IR region research [55]. The whole sequence identity plot of nine species within Litsea was analyzed using mVISTA with L. garretti (NC_050349) set as the reference sequence for comparison ( Figure  5). Genome sequences of the nine Litsea exhibited a high degree of concordance. In this study, we revealed that most of the variations in the cp genome of different species were distributed in CNS (non-coding sequences) regions. Notable high-divergent regions in CNS were atpF-atpH and ndhC-trnV-UAC, the divergent value of which exceeded 100%. Other variant regions include: rps16-trnQ-UUG, ycf4-cemA, rps8-rpl14, and rps12-trnV-GAC. Some of the coding genes, such as ndhK, ndhF, and ycf1, were found to were contain variable regions. In general, the divergence in the IR region was significantly smaller than that in the LSC and SSC regions, a result comparable to the previous divergence analysis [50]. In particular, we found that in the cp genome of L. auriculata, the average length of InDels located in the IR regions contained a considerable number of small InDels rather than only several long InDels, as was found in the other eight species, causing its average length to be three times shorter than others. This result indicated that L. auriculata may have experienced some degree of mutation during its evolution that differed from its related species (Vaccinium) [57].

Nucleotide Divergence and Selection Pressure
Despite general consistency, variations occurred frequently during the evolutionary process, forming different genotypes and phenotypes. These nucleotide variations (Pi) could be distinguished as high divergent regions [58]. Some may accumulate through generations to better adapt to the environmental changes, which is called positive selection. In bioinformatics, the Ka/Ks value is commonly used to evaluate selection pressure. Here, we calculated and analyzed the Pi value of 79 unique protein-coding genes, 101 IGS (intergenic spacer) sequences, and the Ka/Ks value of 79 unique protein-coding genes ( Table S2). Most of the protein-coding genes possessed relatively low diversity, while the rpl16 gene presented with an extremely high Pi value (0.00892) among all samples ( Figure 6A). However, in IGS regions, the Pi values of 64 genes out of 101 exceeded 0.01 ( Figure 6B). Moreover, 54 among them surpassed 0.1. As for selection pressure, after filtering genes with no value, the Ka/Ks value of 23 of 25 genes were less than one using L. Garrettii (NC_050349) as a reference sequence. In other words, these genes were under negative selection pressure. Only two genes, rpl16 and ycf2, presented with a Ka/Ks value of greater than one, undergoing positive selection. No gene presented with a suggested neutral selection ( Figure 6C). Genes 2022, 13, x FOR PEER REVIEW 11 of 20     The hyper-variable regions detected in this study may provide a potential molecul marker for further studies. In particular, the rpl16 gene possesses both a high Pi value an Ka/Ks value at the same time. This might suggest that the rpl16 gene went through a gre mutation that was crucial to the evolution process of Litsea species. Although studies ha reported rpl16 to be one of the highly divergent genes [59] and a positive selection s [60], as unique and significant as the present study is this is not a common in studies other angiosperms. The other positive selection site, the ycf2 gene, was more common described in previous studies [61,62].

Phylogenetic Analysis
The expanding cp genome database provides an important basis for determinin evolutionary relationships [56,[63][64][65] Phylogenetic trees based on different data h slightly varied topologies, with trees based on the whole cp genome and CDS data havin The hyper-variable regions detected in this study may provide a potential molecular marker for further studies. In particular, the rpl16 gene possesses both a high Pi value and Ka/Ks value at the same time. This might suggest that the rpl16 gene went through a great mutation that was crucial to the evolution process of Litsea species. Although studies have reported rpl16 to be one of the highly divergent genes [59] and a positive selection site [60], as unique and significant as the present study is this is not a common in studies of other angiosperms. The other positive selection site, the ycf2 gene, was more commonly described in previous studies [61,62].

Phylogenetic Analysis
The expanding cp genome database provides an important basis for determining evolutionary relationships [56,[63][64][65] Phylogenetic trees based on different data had slightly varied topologies, with trees based on the whole cp genome and CDS data having the same topology, and being more credible than trees based on the IR area and introns [61,[66][67][68][69]. We found two similar topological structures with few changes based on the full cp genome and the protein-coding sequences of 23 selected species, with N. sericea and A. obovate as outgroup species (Table S3, Figure 7). characters [73]. These studies focused on the analysis of the relationships between different genera in Lauraceae. In comparative terms, the phylogenetic relationships constructed from the complete chloroplast genome are more accurate than those constructed from a few fragments [74]. Zhang et al. (2021) [75] suggested that Litsea could be divided into four sub-clades through the chloroplast genome. However, our study has suggested that Litsea is more appropriately divided into two sub-clades (Figure 7). We discovered that both the ML tree and the BI tree had greater support values for the phylogeny reconstructed from complete cp genomes. Such different trees could originate from substitutions in the intergenic spacer regions, which illustrates the importance of non-coding regions in phylogenetic analyses [76]. Therefore, complete cp genomes can be used as a 'super barcode' [77], and they have been demonstrated to be effective for preventing some identification errors and the discovery of new species [78]. Despite minor differences, the phylogenetic relationships of most species in the topologies were consistent, showing similar genetic affinities in the topology, and which aligned nicely with the elevational distribution of the species [71,79].

Conclusions
In this study, we sequenced and reported the complete cp genome sequences of nine species in Litsea, revealing typical quadripartite circular structures. We observed the contraction and expansion of IR boundaries. This event caused gene loss, changes in gene length, and the occurrence of pseudogenes, resulting in the differences between species. In terms of alignment consistency, the LSC region had the largest number of nucleotide variants, and IR regions showed a high degree of conservation. We found that the rpl16 and ycf2 genes underwent great positive selection pressure. Moreover, rpl16 gene also was found to be the only hyper-variable protein-coding gene in the gene divergent analysis, which was evaluated by the Pi value. This phenomenon is rare and further studies to unfold the molecular mechanism behind is needed.
Phylogenetic relationships within the genus were explored using two sets of data from the complete cp genome and another from 64 sets of protein-coding genes shared by 21 Litsea and two outgroup species. Essentially the same conclusions were obtained: L. moupinensis and Litsea rubescena, L. chunii, and L. tsinlingensis were sisters in the phylogenies and showed similar genetic relationships consistent with their elevational distributions. This study provides aid to taxonomic studies for Litsea, providing specific genetic markers for taxon identification and for inferring evolutionary relationships among the species. These data may also contribute to future conservation efforts as well as the practical use of these species.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13091550/s1, Table S1: Raw data of single sequence repeats of nine Litsea species, grouped by mono-, di-, tri-, tetra, penta-, and hexanucleotide repeats, along with the percentage of each group. Table S2: Raw data of Ka/Ks value and pi value of 79 protein-coding genes. Raw data of pi value of 101 IGS sequences. Table S3: Details of taxonomy and accession numbers of species mentioned in this study. File S1: The phylogenetic trees resulting from Bayesian inference analyis.

Data Availability Statement:
The data that support the findings of this study are openly available in the Genbank database at https://www.ncbi.nlm.nih.gov/, under accession number [NC_056809 -NC_056817] (accessed on 1 September 2021).