Complete Chloroplast Genomes of Fagus sylvatica L. Reveal Sequence Conservation in the Inverted Repeat and the Presence of Allelic Variation in NUPTs

Growing amounts of genomic data and more efficient assembly tools advance organelle genomics at an unprecedented scale. Genomic resources are increasingly used for phylogenetic analyses of many plant species, but are less frequently used to investigate within-species variability and phylogeography. In this study, we investigated genetic diversity of Fagus sylvatica, an important broadleaved tree species of European forests, based on complete chloroplast genomes of 18 individuals sampled widely across the species distribution. Our results confirm the hypothesis of a low cpDNA diversity in European beech. The chloroplast genome size was remarkably stable (158,428 ± 37 bp). The polymorphic markers, 12 microsatellites (SSR), four SNPs and one indel, were found only in the single copy regions, while inverted repeat regions were monomorphic both in terms of length and sequence, suggesting highly efficient suppression of mutation. The within-individual analysis of polymorphisms showed >9k of markers which were proportionally present in gene and non-gene areas. However, an investigation of the frequency of alternate alleles revealed that the source of this diversity originated likely from nuclear-encoded plastome remnants (NUPTs). Phylogeographic and Mantel correlation analysis based on the complete chloroplast genomes exhibited clustering of individuals according to geographic distance in the first distance class, suggesting that the novel markers and in particular the cpSSRs could provide a more detailed picture of beech population structure in Central Europe.


Introduction
Chloroplasts not only play a key role in photosynthesis but also other metabolic processes of green plants [1]. The generally maternal inheritance of the chloroplast genome in Angiosperms and relatively conserved gene content and order has made chloroplast genomes a valuable resource for phylogenetic and evolutionary studies [2,3]. Plant chloroplast genomes are mostly between 120 kb and 160 kb in length and usually have a quadripartite circular structure comprising of two regions of inverted repeats A and B (IR-A/IR-B), separated by a large single-copy (LSC) region and a small single-copy (SSC) region [4]. Due to next-generation sequencing approaches, sequencing of chloroplast genomes for dozens or hundreds of individuals is now achievable through whole genome sequencing [5,6]. Chloroplast genome sequences have helped to elucidate the phylogenetic relationships and evolutionary history of many tree genera, including Acer [7], Prunus [8], Populus [9,10], Quercus [11,12] and Pinus [13]. With an increased availability of whole chloroplast sequences, numerous studies demonstrated the presence of variation among individuals within species, which includes SNPs, indels, inversions, translocations, copy number variations and also IR expansion, gene loss and intron retention [14]. The level of this variability is usually considered low, both in terms of composition, as well as in terms of the degree of variation within the different regions [14][15][16][17]. Interestingly, the individuals themselves have been reported to show variations of their chloroplast genomes through heteroplasmy-which can occur as a result of independent mutations or biparental inheritance of organelles in one organism [18]. While chloroplast genomes can be a good tool for phylogeographic analyses, such studies are currently limited to only few conifers [16,17].
European beech (Fagus sylvatica L.) is ecologically and economically one of the most important broadleaved tree species in Europe [19]. There are several molecular studies that evaluated genetic diversity and structure of European beech using chloroplast DNA [20][21][22][23][24][25][26][27]. Most of these studies showed a low level of chloroplast diversity and a rather homogeneous genetic structure in Central Europe, but none of them exploited the full potential resolution offered by complete chloroplast genomes. Thus, there is still lack of comparative analyses based on complete chloroplast genomes, which would allow to identify novel chloroplast polymorphisms and to detect genetic structure of European beech at a regional scale. Recently, Mishra and coauthors [26] evaluated three complete chloroplast genomes of beech from areas glaciated during the Weichselian glacial maximum and found a very low genetic variation with only two SNP and three indel positions. This raised the question of whether the low variation found was due to genetic empoverishment by founder effects at the leading edge of the recolonization or if chloroplast genetic diversity is generally low in European beech. To clarify this, there is a need to assess genetic diversity of complete chloroplast genome of European beech sampled from a wider range, which was the aim of the current study.
Here, we report 16 newly sequenced and assembled complete chloroplast genomes of F. sylvatica and perform comparative genomic analyses of the new sequences with the two recently published chloroplast genomes: Bhaga (MW531753) and Jamy (MW537046) [26]. The aim of our study was to identify potentially highly variable markers in chloroplast genome of F. sylvatica suitable for phylogeographic studies as a useful genetic resource for developing chloroplast-based genetic markers (SNPs and SSRs) for large-scale population studies.

DNA Isolation and Sequencing
Details regarding DNA isolation and sequencing of Bhaga and Jamy individuals are given in Mishra and coauthors [26]. The remaining 16 individuals representing a wide range of the species distribution were collected from Siemianice provenance trial [28]. Detailed geographic locations are presented in Table 1 and Figure S1. DNA was isolated from leaves with a GeneMATRIX Plant & Fungi DNA Purification Kit (EURx, Poland), after storing them in the dark for 48 h. Genomic library preparation and sequencing was done by an external service provider (IGA Technology Services s.r.l.) with Illumina HiSeq 250 device in 125-bp PE mode. The obtained reads were purified from adapters and trimmed with Trimmomatic [29]. Raw reads were deposited in SRA under the accession numbers listed in Table 1.

Chloroplast Genome Assemblies and Annotation
Methods describing assembly and annotation of chloroplast genomes of Bhaga and Jamy individuals are presented in Mishra and coauthors [26]. All the remaining 16 chloroplast genomes where generated using the same protocol: Illumina reads were used for de novo assembly using NOVOPlasty v 4.2. [30,31] with seed sequence NCBI: AY453092.1 [32] and Bhaga chloroplast genome for guiding the program in both inverse repeat regions. After assembling, the sequences were manually checked, in case of presence of ambiguous nucleotides manual curation was done with the assistance of reads mapped to a genome with bwa-mem [33] and visualization of the results in Tablet software [34]. Coding sequences and RNA elements annotation was done with GeSeq ChloroBox [35] using chloroplast genomes of F. crenata (NC_041252; [36]), F. engleriana (KX852398; [37]) F. japonica (MT762294; [38]) and F. sylvatica (NC_041437; [39]) as references. Postprocessed annotated genomes where uploaded to GenBank, for accession numbers see Table 1.

Assessment of Genome Variation
REPuter was employed to identify four types of large repeating sequences (reverse, forward, complement and palindromic) with a minimum repeat size of 30 bp, hamming distance equal to 3 maximum computed repeats set to 50 [40]. Identification of chloroplast simple sequence repeats (cpSSR) was done using MISA [41]. The minimum number of repeat units was set to eight, six, five, five, three, and three, for mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides, respectively. For assessment of variance between the 18 studied chloroplast genomes alignments were done with MAFFT v 7.450 [42], as implemented in Unipro UGENE [43]. After this variations among the genomes where highlighted using the Bhaga chloroplast genome as reference.

Detection of Heteroplasmy
Potential chloroplast genome variation within individuals (heteroplasmy) was assessed with mapping of reads of each individual with bwa [33] to extracted SSC, LSC and IR regions of the genomes. The marker calling was done with Freebayes [44] with 0.02 minor allele frequency (MAF) and depth of 200x thresholds for variant detection to avoid sequencing error [45]. To verify the origin of the markers reads were mapped to chromosome 10 of the Fagus sylvatica nuclear assembly [46].

Phylogenetic Analysis
Phylogenetic analysis was done with on the basis of the dataset of the 18 Fagus sylvatica assemblies to which the complete chloroplast genomes of F. crenata (NC_041252; [36]), F. japonica (MT762294; [38]) and F. engleriana (KX852398; [37]) before alignment as described above. Phylogenetic reconstructions were done using IQ-TREE with the GTR+I+R model [47][48][49] and 1000 bootstrap replicates. The resulting phylogram was edited with The phylogenetic distance matrix obtained from IQ-TREE was also used to test the relationship between genetic and geographic distances among individuals. The correlation was calculated with PASSaGE 2 [50] using Mantel's test [51] for a global assessment and Mantel's correlogram to search for significance within 10 equally paired distance classes (the largest class excluded). All tests were performed with 10,000 permutations.

Assembly Size Variance and Genome Annotation
Read coverage of each of 16 new assemblies varied from 86x to 625x with average value of 302x and 284x median. Chloroplast genome structure was stable throughout the studied sequences, and assembly sizes varied from 158,391 bp to 158,464 bp, with highest length variance observed in the large single copy region (LSC) (87,634-87,706 bp). The small single copy regions (SSC) differed only by 4 bp (19,010-19,013 bp), while both inverted repeat regions (IR-A/IR-B) where identical in length (25,873 bp) ( Table 2). Similarly to the previously published F. sylvatica chloroplast assemblies [26,39] each of the 16 new genomes had an identical set of 131 annotated elements: 83 protein coding genes, 8 rRNAs and 40 tRNAs. Total share of coding elements differed across main elements of the genome, with 51% in LSC, 72.1% in SSC and 59.1% in IR regions.

Repeat elements and SNPs
Large repeating sequence (LRS) assessment showed that sixteen genomes had 32 LRSs >30 bp: 16 forward, 13 reverse, one complement and two palindromic. Two assemblies (Gdańsk and Glorup) had 31 LRSs as a result of the loss of one palindromic match, and one chloroplast genome (Ehingen) had 33 LRSs with an additional reverse match compared to the 16 previously mentioned assemblies. Analysis of cpSSR using MISA detected a total number of 138 markers, out of which only 4 of 97 mononucleotide cpSSRs and 8 of 35 complex cpSSRs were polymorphic. All discovered dinucleotide and pentanuclotide cpSSRs were found to be monomorphic (Table 3). In the SSC region we found four polymorphic markers: two mononucleotide and two complex repeats. Variation of the SSR mononucleotide T at position 12,538 occurred due to the absence of a microsatellite in one of the individuals (Fantanelle). The LSC region had eight polymorphic markers with six complex and two mononucleotide cpSSR (Table 4). Marker ratio, reflecting the number of individuals associated with a particular variant, showed that in eight sites an alternative nucleotide was present, while three cpSSR loci had two, and one locus had five variants.   (Table 5).

Within Individual Polymorphisms
Within individual polymorphism related to chloroplast genome was found in all 16 tested individuals. After filtering with and minimum depth >200x and MAF of 0,02 a total number of 9028 markers where detected in all analyzed regions.
The average depth of each base at a variant position was lower in single copy elements 349x and 360x in LSC and SSC respectively, while both IR regions had 477x depth (Table 6). However, the average alternative variant's depth was very similar across the main genome regions from 16.1x in SSC, 18.4-18.5x in IRs and 18.7x in LSC. This suggests that these variants represent the nuclear encoded plastome sequences (NUPTs), as the these values are similar to the average coverage of 17x found on chromosome 10 of the complete nuclear genome. This chromosome was selected for comparative analysis due to lowest NUPT detection in the whole assembly. Among the variant positions majority of them where SNPs (76.8-84.1%), the remaining share was associated to indels (8.8-10.2%), complex markers (3.1-8.2%) and MNPs (0.2-0.7%). In 2.7-4.6% of sites a mix of variants was detected e.g., a SNP and an indel called at a specific position in the same individual.
Markers detected in this study where found both in coding (48.6-67.3%) and noncoding areas (32.7-51.4%). The size of the contribution in each of these parts was related to the size of coding and non-coding regions of the main genome element (Figure 1).

Phylogenetic Analysis
The complete chloroplast genome sequences of 18 F. sylvatica individuals, as well as F. crenata, F. japonica and F. engleriana, were used to for a phylogenetic reconstruction based on maximum likelihood method and 1000 bootstrap replicates (Figure 2).

Phylogenetic Analysis
The complete chloroplast genome sequences of 18 F. sylvatica individuals, as well as F. crenata, F. japonica and F. engleriana, were used to for a phylogenetic reconstruction based on maximum likelihood method and 1000 bootstrap replicates ( Figure 2).

Phylogenetic Analysis
The complete chloroplast genome sequences of 18 F. sylvatica individuals, as well as F. crenata, F. japonica and F. engleriana, were used to for a phylogenetic reconstruction based on maximum likelihood method and 1000 bootstrap replicates ( Figure 2).  While a global Mantel's test did not reveal significant relationship (r = 0.2190; p = 0.1861) between the phylogenetic and geographic distances for the 18 individuals, clustering of similar individuals was confirmed with Mantel's correlogram within the first distance class (<250 km; r = 0.286; p = 0.011) ( Table 7).

Discussion
Complete chloroplast genomes have helped to reveal species relationships [9,11,12], but also allow to measure divergence within populations [16,17]. Growing genomic resources for European beech provide tools to extend our knowledge on this critically important forest tree species. Our results support previous hypotheses suggesting low genetic diversity of the beech plastome [23,52,53].
The total genome length variation (158,428 ± 37 bp) and the presence of polymorphisms were associated exclusively with Single Copy (SC) dions, while the pairs of IR regions where monomorphic both in terms of the length (25,873 bp) and in terms of nucleotide sequence. Stability of IR and variability of SC regions was also present in sequences of F.japonica (MT762294; MT762295), the results suggest a powerful gene conversion mechanism in Fagus species. Our study revealed 138 cpSSR in F. sylvatica out of which 126 where monomorphic. This group included universal cpSSRs: ccmp4, ccmp7, ccmp10, commonly used to assess phylogeny and relationship in eudicot species [54,55]. Magri and coauthors [23] using these markers concluded that Central Europe beech populations generally can be considered as a homogeneous group. The 12 polymorphic microsatellite markers discovered in this study, when applied for a higher number of individuals and populations, could, however, potentially provide a more detailed phylogeographic picture. Our phylogeographic analysis supports this assumption due to significant clustering of individuals over a relatively short distance <250 km.
Additional source of variation was found in in SNPs (4) and indel (1), all located in noncoding regions, but their position is not in line with results obtained based on reduced representation genomic libraries presented by Meger and coauthors [27]. Heteroplasmy is well reported in plants with known biparental inheritance of chloroplasts, even though in some species (e.g., Passiflora) it can occur at the seedling and embryo but not at the mature developmental stages [56]. In beech, due to maternal inheritance of organelles [3], heteroplasmy can exclusively be caused by mutations. The evidence of multiple integrations of organelle DNA integration with the nuclear genome in beech [46] and detection of withinindividual polymorphisms of cpDNA-related sequences presented in this study suggest that assessing beech diversity with chloroplast related SNPs due to a large occurrence of nuclear encoded of plastid DNA (NUPTs) can lead to uncertain results and should be taken with caution [57].