Complete Chloroplast Genomes of Three Salix Species: Genome Structures and Phylogenetic Analysis

High genetic diversity and low differentiation present challenges in taxonomy and systematics of Salix. Chloroplast (cp) genome sequencing is efficient for providing new genomic information and elucidating phylogenetic relationships. Salix spathulifolia Seemen, S. cupularis Rehder, and S. annulifera C.Marquand & Airy Shaw are three shrubby willows spread in high-altitude regions in western China. In this study, the integrated circular cp genomes were sequenced and analyzed, and a phylogeny of Salix was constructed on the basis of the cp genomes. The results of chloroplast assembly and annotation information were used to characterize genome feature and interspecific variation. The phylogenetic position of the three willows was evaluated using phylogenetic analysis. Full-length cp genomes were 155,566–155,680 bp with a typical double-stranded circular quadripartite structure, containing one large single-copy region (LSC, 84,431–4552 bp), one small single-copy region (SSC: 16,206–16,221 bp), and two inverted repeats (IR: 27,453–27,461 bp). The cp genomes encoded 130 genes, including 8 rRNA genes, 37 tRNA genes, and 85 protein-coding genes. The guanine-cytosine (GC) content of the overall genome was 36.7%. Comparison among the three willows’ cp genomes revealed high similarity. Phylogenetic analysis indicated that S. spathulifolia was a basal taxon of clade I, while S. annulifera formed a monophyletic group with S. rorida Laksch.; S. cupularis was sister to S. suchowensis W.C. Cheng and S. psammophila Z. Wang & Chang Y. Yang. The complete chloroplast genomes of the three willows provides an additional sequence-based resource for studying the phylogeny and evolutionary history of Salicaceae.


Introduction
The chloroplast (cp) is an essential semi-autonomous organelle within the cells of green plants, responsible for carbon fixation, energy conversion, and other important biochemical pathways [1]. Moreover, it possesses its own independent circular genome [2]. The cp genomes of most higher plants are characterized by a quadripartite structure typically consisting of two inverted repeats (IRs) with a length of 20-28 kb, separated by a large single-copy (LSC) region with a length of 80-90 kb and a small single-copy (SSC) region with a length of 16-27 kb. Almost all cp genomes range in size from 120 to 160 kb [3] with 110-130 uniquely encoded genes [4], exhibiting highly conserved gene content and order, with most genes involved in major functions including photosynthesis, transcription, and translation [5,6]. Information provided by structural characteristics of chloroplast genomes can be helpful for inferring phylogenetic relationships among broad sets of plant taxa and enable studies on evolutionary forces that shape the plastome size and structure, such as gains and losses of genes and introns, expansion and contraction of the IRs, and inversion. Furthermore, analyzing the structure information of the chloroplast genome is of great significance for the study and utilization of chloroplast photosynthesis [7]. Furthermore, HiSeq4000 platform with an insert size of around 350 bp at Shanghai Majorbio Bio-pharm Technology Co., Ltd, Shanghai, China. The raw data after sequencing the chloroplast genomes of three willows were taken from Gulyaev et al. in preparing [26]. In order to ensure the quality of subsequent analysis, we used Fastp with default settings to remove low-quality sequences and adapters from the raw data [27]. We deposited the samples of the complete chloroplast genome sequences to the NCBI GenBank (accessions MZ365445, MZ365446, MZ365447).

Chloroplast Genome Assembly and Annotation
We used Geneious 10.2 to conduct the entire chloroplast genome assembly process [28], mainly following He et al. [29]. Low-quality sequencing bases at both ends were clipped out with an error probability limit of 0.05. Then, the Map to Reference function was used to map filtered reads which excludes nuclear and mitochondrial reads to published plastid genomes of S. babylonica L. (MF189167) [30], S. chaenomeloides Kimura (MG262362), S. rehderiana C.K.Schneid. (MG262367), S. rorida (MG262368), and S. taoensis Goerz ex Rehder & Kobuski (MG262369), which were set as references (iterate up to 10 times). The subsequent chloroplast reads were used for de novo assembly of the contigs with a mediumlow sensitivity setting. Usually, there is only one contig (approximately 130 kb) that results from de novo assembly. The filtered reads were repeatedly mapped to the contigs to increase length; if more than one smaller contig was obtained, overlapping regions were used to enable all the contigs to be concatenated to a 130 kb contig with four junctions between two single-copy and IRs regions being confirmed. The Find Repeats Plug-in function was used to define the IR regions, which were copied and inverted manually to construct complete chloroplast genome sequences.
Plastid Genome Annotator (PGA) was used to perform initial annotations of complete chloroplast genome [31] with Salix suchowensis [32] as a reference and manually corrected with Plann 1.1.2 [33] and Geneious [28]. Finally, the circular graphical map of the cp genomes was drawn by the OGDRAW online tool [34].

Repeat Sequence and Codon Usage Analysis
We applied MISA-web to predict repeat sequences of chloroplast microsatellites (cpSSRs) [35] with parameters set to: minimum number of eight repeated motifs for mononucleotide, five repeated motifs for dinucleotide, four repeated motifs for trinucleotide, and three repeated motifs for tetra-and pentanucleotide repeats. Online REPuter software was employed to discover interspersed repeats in whole cp genomes, which included forward, reverse, complement, and palindromic repeats [36] with a minimum repeat size of 30 bp and the Hamming distance set to three. We used Geneious 10.2 [28] to calculate the percent of guanine-cytosine (GC). The relative synonymous codon usage (RSCU) determining the preference for the use of a codon was generated using MEGA 7 [37] based on 85 protein-coding genes (PCG) sequences. An RSCU not greater than 1.0 means no preference, while an RSCU greater than 1 means preference.

Genome Comparison and Sequence Divergence
Three willow cp genomes were globally compared with the Shuffle-LAGAN mode in online mVISTA software with default settings [38] by employing the S. spathulifolia as an annotation reference. Moreover, we conducted a DNA polymorphism analysis to detect nucleotide diversity (π) with a 100 bp window size and a 25 bp step size by using DnaSP  [39]. IRscope online [40] was used to analyze and visualize the borders between LSC/IRs and SSC/IRs among the three willow cp genomes.

Phylogenetic Research
The maximum likelihood (ML) phylogeny was applied to investigate the phylogenetic status of S. spathulifolia, S. annulifera, and S. cupularis. The cp genome sequences of 18 Salix and two Populus species were obtained from the NCBI genome database (National Center for Biotechnology Information). After manually checking the region direction, a total of 20 cp genomes along with three target willows were ultimately aligned using MAFFT online version [41] and then by Phylosuite V1.2.2 [42]. The ML tree was estimated using the best nucleotide replacement model GTR + F + I estimated by ModelFinder [43] implemented in Phylosuite [42]. The two Populus species were set as outgroups. We used bootstrap replicates of 5000 to test the confidence of each branch.

Genome Features
Full-length complete cp genomes of the three Salix species were accomplished varying from 155,566 to 155,680 bp and the genome structure comprised two IR regions (range from 27,453 to 27,461 bp), separated by an LSC region (84,431-84,552 bp) and an SSC region (16,221 bp) ( Figure 1, Table 2). The overall GC content for all three willow species was about 36.7%, while IR regions displayed a higher GC content (41.9%) than the GC content of LSC (34.4%) and SSC (31.0%). This GC content distribution pattern seemed to be common in other plants [44], which is caused by a relatively high GC content in rRNA and tRNA genes [45], occupying more area than that of protein-coding genes in IR regions.    In each of the three cp genomes, 130 genes were predicted (85 protein-coding genes, 37 tRNAs, and 8 rRNAs), of which 19, 20, and 21 genes were duplicated in IR regions, respectively ( Table 2). Seventeen genes containing introns were identified, which was the same as in the previous study [30].
According to the genes' functions, they can be divided into 17 groups, of which 57 genes were found to be related to self-replication: 11, 8, 4, 4, 30 genes were responsible for encoding small ribosomal subunit protein, large ribosomal subunit protein, and RNA polymerase subunits, rRNAs, and tRNAs respectively. Another category contained 46 genes related to photosynthesis, involving 7 photosystem I genes, 15 photosystem II genes, 6 cytochrome b/f complex genes, 6 ATP synthase genes, 11 ndh genes, and 1 large subunit Rubisco gene (Table 3). Table 3. Gene composition in the chloroplast genomes of S. spathulifolia, S. cupularis, and S. annulifera.

Codon Usage and Repeat Sequence Analysis
A total of 26,115, 26,131, and 26,138 codons were identified in the cp genomes of S. cupularis, S. annulifera, and S. spathulifolia, respectively. The number of codons encoding leucine was the largest and the least abundant were those for cysteine, which was the same as for the other two species. Across the three genomes, there were 30 preferred synonymous codons (RSCU > 1); except for the UUG codon ending with G, the others ended with A or U. Moreover, the use of the two codons AUG and UGG had no preference for codon usage (RSCU = 1) ( Table 5).  Using MISA-web, we detected 222 SSRs in the three cp genomes, of which the most were mono-and tetra-nucleotide types. In S. spathulifolia, the most abundant SSR type was mononucleotide repeats, with a total number of 197, followed by tetranucleotide (12), dinucleotide (10), and the least abundant were trinucleotide (1), pentanucleotide (1), and hexanucleotide (1) (Table S1 and Figure 2a). The numbers of mononucleotides in S. cupularis and S. annulifera were 199 and 196, respectively; however, the numbers of trinucleotide and pentanucleotide repeats were identical. Variations in the dinucleotide, tetranucleotide, and hexanucleotide repeats were no more than one. No hexanucleotide repeats were identified in the S. cupularis cp genome. In terms of base composition, among the three cp genomes, the most common type of SSRs was A/T type mononucleotide, accounting for 82.88-84.23% (Figure 2b). The longest SSR had a length of 18 bp while the shortest was 8 bp (Table S1). Comparison among the three willows showed high similarity in SSR type and quantity.  (Table S2). Comparatively, S. annulifera had the smallest number of repeats both in total (34) and in each group (21 forward, 11 palindromic, 1 reverse, 1 complement). The four repeat types were rank-order arranged according to the number of repeats of each type in three willows: Forward (F)-  (Table S2). Comparatively, S. annulifera had the smallest number of repeats both in total (34) and in each group (21 forward, 11 palindromic, 1 reverse, 1 complement). The four repeat types were rankorder arranged according to the number of repeats of each type in three willows: Forward (F)-most; Palindromic (P)-second; Reverse (R)-third; and Complement (C)-least. In addition, no complement repeats were found in the cp genome of S. cupularis. most; Palindromic (P)-second; Reverse (R)-third; and Complement (C)-least. In addition, no complement repeats were found in the cp genome of S. cupularis.

Genome Comparison
The chloroplast genomes of the three Salix taxa were compared and analyzed using mVISTA online software with S. spathulifolia as a reference. Overall, the comparative genomic analysis showed that the three willows' cp genome sequences were relatively conserved. Generally, higher variation was found in non-coding regions. The most divergent regions in the coding sequences were psbL, rpl16, and ycf1. The regions with relatively higher variation were mainly concentrated in the LSC and SSC regions; however, the IR regions were more conserved (Figure 4). Similar results can be revealed by a sliding window analysis, which determines the nucleotide diversity in chloroplasts. The average Pi of the three chloroplast genomes was 0.00075, and high nucleotide variability (π) was exhibited at the SC regions in comparison to IR regions ( Figure 5). Due to the difficulty in resolving the phylogeny of the genus Salix, these divergent regions-with the highest variation in Salix cp genomes-can be identified as a source of potential molecular markers.
The four junctions in the three willows' cp genomes were presented using IRscope ( Figure 6). Overall, the distribution of genes on each border was almost identical, with only slight differences among the three willows. The LSC/IRb border was inside the rpl22 gene in all three willows. The 347 bp fragment of the rpl22 is located within the LSC region, whereas the remaining 52 bp section of this gene can be found within the IRb region. The IRb/SSC boundary is located inside the ndhF gene. The 10 bp fragment of the ndhF gene is located within the IRb region, whereas the remaining 2264 bp section of this gene can be found within the SSC region. The ycf1 gene is located across the SSC/IRa junction. Compared to S. annulifera, the parts of the ycf1 gene in the SSC of S. cupularis and S. spathulifolia were 6 bp shorter, whereas the parts of the ycf1 gene in the IRa region were the same (1748 bp). The rps19 gene was fully located in the IRa, while the trnH gene was fully located in the LSC, 1 bp away from the IRa/LSC border.

Genome Comparison
The chloroplast genomes of the three Salix taxa were compared and analyzed using mVISTA online software with S. spathulifolia as a reference. Overall, the comparative genomic analysis showed that the three willows' cp genome sequences were relatively conserved. Generally, higher variation was found in non-coding regions. The most divergent regions in the coding sequences were psbL, rpl16, and ycf1. The regions with relatively higher variation were mainly concentrated in the LSC and SSC regions; however, the IR regions were more conserved (Figure 4). Similar results can be revealed by a sliding window analysis, which determines the nucleotide diversity in chloroplasts. The average Pi of the three chloroplast genomes was 0.00075, and high nucleotide variability (π) was exhibited at the SC regions in comparison to IR regions ( Figure 5). Due to the difficulty in resolving the phylogeny of the genus Salix, these divergent regions-with the highest variation in Salix cp genomes-can be identified as a source of potential molecular markers.
The four junctions in the three willows' cp genomes were presented using IRscope ( Figure 6). Overall, the distribution of genes on each border was almost identical, with only slight differences among the three willows. The LSC/IRb border was inside the rpl22 gene in all three willows. The 347 bp fragment of the rpl22 is located within the LSC region, whereas the remaining 52 bp section of this gene can be found within the IRb region. The IRb/SSC boundary is located inside the ndhF gene. The 10 bp fragment of the ndhF gene is located within the IRb region, whereas the remaining 2264 bp section of this gene can be found within the SSC region. The ycf1 gene is located across the SSC/IRa junction. Compared to S. annulifera, the parts of the ycf1 gene in the SSC of S. cupularis and S. spathulifolia were 6 bp shorter, whereas the parts of the ycf1 gene in the IRa region were the same (1748 bp). The rps19 gene was fully located in the IRa, while the trnH gene was fully located in the LSC, 1 bp away from the IRa/LSC border.

Phylogenetic Analysis
To explore the phylogenetic status of these three willows, an ML phylogeny was built using the complete cp genomes of 20 Salicaceae species published on NCBI and three in our study ( Figure 7). As shown in Figure 7, all the available willows were evidently divided into two major groups, consistent with a previous study [49]. The three target willows were all in clade I. S. spathulifolia was at the base of clade I; S. annulifera was most closely to S. rorida, while S. cupularis was resolved as a separate branch and placed as a sister group to S. suchowensis and S. psammophila, which belonged to sect. Helix. The phylogenetic tree generated a total of 21 nodes, and bootstrap support was 100% in 11 out of 21 branch nodes.
However, in Li et al. (2021) the phylogenetic placement of Salix cupularis contradicted our result, which may be related to some reasons about constructing phylogenetic tree and origin of samples. Firstly, outgroup sampling was a crucial step in phylogenetic analyses, affecting homoplasy between ingroup and the outgroup and statistical error rate, thus, an outgroup with large genetic distance from the ingroup may share few relevant character states and took a long time to accumulate homoplasy, attach randomly to the ingroup, and bias ingroup relationships, and this was the reason why we selected genus Populus, the sister group of Salix to root the phylogenetic tree [50][51][52]. Meanwhile, methods of tree construction may have an effect on the phylogenetic results. The reconstruction methods of phylogeny can be divided into two categories: distance-based and sequence-based. The most popular distance method was the NJ method, which used the distance matrix as input to specify the distance between each pair of taxa, however, different from the NJ method, ML method was based on a clear sequence evolution model and likelihood function, one of the advantages of the ML method was that all its model

Phylogenetic Analysis
To explore the phylogenetic status of these three willows, an ML phylogeny was built using the complete cp genomes of 20 Salicaceae species published on NCBI and three in our study ( Figure 7). As shown in Figure 7, all the available willows were evidently divided into two major groups, consistent with a previous study [49]. The three target willows were all in clade I. S. spathulifolia was at the base of clade I; S. annulifera was most closely to S. rorida, while S. cupularis was resolved as a separate branch and placed as a sister group to S. suchowensis and S. psammophila, which belonged to sect. Helix. The phylogenetic tree generated a total of 21 nodes, and bootstrap support was 100% in 11 out of 21 branch nodes.
However, in Li et al. (2021) the phylogenetic placement of Salix cupularis contradicted our result, which may be related to some reasons about constructing phylogenetic tree and origin of samples. Firstly, outgroup sampling was a crucial step in phylogenetic analyses, affecting homoplasy between ingroup and the outgroup and statistical error rate, thus, an outgroup with large genetic distance from the ingroup may share few relevant character states and took a long time to accumulate homoplasy, attach randomly to the ingroup, and bias ingroup relationships, and this was the reason why we selected genus Populus, the sister group of Salix to root the phylogenetic tree [50][51][52]. Meanwhile, methods of tree construction may have an effect on the phylogenetic results. The reconstruction methods of phylogeny can be divided into two categories: distance-based and sequence-based. The most popular distance method was the NJ method, which used the distance matrix as input to specify the distance between each pair of taxa, however, different from the NJ method, ML method was based on a clear sequence evolution model and likelihood function, one of the advantages of the ML method was that all its model

Phylogenetic Analysis
To explore the phylogenetic status of these three willows, an ML phylogeny was built using the complete cp genomes of 20 Salicaceae species published on NCBI and three in our study ( Figure 7). As shown in Figure 7, all the available willows were evidently divided into two major groups, consistent with a previous study [49]. The three target willows were all in clade I. S. spathulifolia was at the base of clade I; S. annulifera was most closely to S. rorida, while S. cupularis was resolved as a separate branch and placed as a sister group to S. suchowensis and S. psammophila, which belonged to sect. Helix. The phylogenetic tree generated a total of 21 nodes, and bootstrap support was 100% in 11 out of 21 branch nodes.
However, in Li et al. (2021) the phylogenetic placement of Salix cupularis contradicted our result, which may be related to some reasons about constructing phylogenetic tree and origin of samples. Firstly, outgroup sampling was a crucial step in phylogenetic analyses, affecting homoplasy between ingroup and the outgroup and statistical error rate, thus, an outgroup with large genetic distance from the ingroup may share few relevant character states and took a long time to accumulate homoplasy, attach randomly to the ingroup, and bias ingroup relationships, and this was the reason why we selected genus Populus, the sister group of Salix to root the phylogenetic tree [50][51][52]. Meanwhile, methods of tree construction may have an effect on the phylogenetic results. The reconstruction methods of phylogeny can be divided into two categories: distance-based and sequence-based. The most popular distance method was the NJ method, which used the distance matrix as input to specify the distance between each pair of taxa, however, different from the NJ method, ML method was based on a clear sequence evolution model and likelihood function, one of the advantages of the ML method was that all its model assumptions are clear, so they can be evaluated and improved, generally speaking, if the model was appropriate, ML method outperformed other methods [53]. Furthermore, considering the considerable intraspecific genotypical polymorphisms, extensive interspecific hybridization, and chloroplast capture result from reduced sterility barriers among taxa, establishing phylogeny of the genus Salix was difficult [54]. Our sample was taken from the vicinity of the type locality, which was some distance from the sampling sites mentioned in Li et al. [24]. We inferred that this may also be one of the reasons for the inconsistent phylogenetic results.
assumptions are clear, so they can be evaluated and improved, generally speaking, if the model was appropriate, ML method outperformed other methods [53]. Furthermore, considering the considerable intraspecific genotypical polymorphisms, extensive interspecific hybridization, and chloroplast capture result from reduced sterility barriers among taxa, establishing phylogeny of the genus Salix was difficult [54]. Our sample was taken from the vicinity of the type locality, which was some distance from the sampling sites mentioned in Li et al. [24]. We inferred that this may also be one of the reasons for the inconsistent phylogenetic results.

Conclusions
In this study, the whole chloroplast genomes of S. annulifera, S. cupularis, and S. spathulifolia were obtained by next-generation sequencing. Our results showed that the genomes ranged from 155,566 to 155,680 bp in length, encoding 130 genes. Comparison among these three willows' cp genomes showed similarity in their structure and organization. Phylogenetic analysis revealed that Salix spathulifolia, S. cupularis, and S. annulifera could be found within one clade among all studied Salix species. S. spathulifolia was located at the base of the clade, S. annulifera was sister to S. rorida, and S. cupularis was closest to S. suchowensis and S. psammophila. The complete cp genomes of these three willows may

Conclusions
In this study, the whole chloroplast genomes of S. annulifera, S. cupularis, and S. spathulifolia were obtained by next-generation sequencing. Our results showed that the genomes ranged from 155,566 to 155,680 bp in length, encoding 130 genes. Comparison among these three willows' cp genomes showed similarity in their structure and organization. Phylogenetic analysis revealed that Salix spathulifolia, S. cupularis, and S. annulifera could be found within one clade among all studied Salix species. S. spathulifolia was located at the base of the clade, S. annulifera was sister to S. rorida, and S. cupularis was closest to S. suchowensis and S. psammophila. The complete cp genomes of these three willows may provide an opportunity for further molecular phylogenetic and evolutionary studies in Salicaceae.