A Circular Chloroplast Genome of Fagus sylvatica Reveals High Conservation between Two Individuals from Germany and One Individual from Poland and an Alternate Direction of the Small Single-Copy Region

: Chloroplasts are difﬁcult to assemble because of the presence of large inverted repeats. At the same time, correct assemblies are important, as chloroplast loci are frequently used for biogeography and population genetics studies. In an attempt to elucidate the orientation of the single-copy regions and to ﬁnd suitable loci for chloroplast single nucleotide polymorphism (SNP)-based studies, circular chloroplast sequences for the ultra-centenary reference individual of European Beech ( Fagus sylvatica ), Bhaga, and an additional Polish individual (named Jamy) was obtained based on hybrid assemblies. The chloroplast genome of Bhaga was 158,458 bp, and that of Jamy was 158,462 bp long. Using long-read mapping on the conﬁguration inferred in this study and the one suggested in a previous study, we found an inverted orientation of the small single-copy region. The chloroplast genome of Bhaga and of the individual from Poland both have only two mismatches as well as three and two indels as compared to the previously published genome, respectively. The low divergence suggests low seed dispersal but high pollen dispersal. However, once chloroplast genomes become available from Pleistocene refugia, where a high degree of variation has been reported, they might prove useful for tracing the migration history of Fagus sylvatica in the Holocene.


Introduction
The European Beech is a dominant species in Germany and other parts of Central Europe [1]. It is valued for its wood and provides crucial resources for a number of species that depend on it for survival [2]. Due to climate change, the European beech has recently experienced drought stress in several areas [3,4], which probably marks a transition to oak (Quercus robur s.l.) in drier regions of Germany [5]. However, succession from oak-to beechdominated forest stands is also observed, in particular in north-eastern beech distribution limits [6]. There are several population genetics studies for European Beech available [7][8][9][10][11][12], and some of them are focused on chloroplast-based genetic diversity [9,[13][14][15][16]. However, analyses considering SNPs in the complete chloroplast genome are so far lacking, even though such a dataset might help understanding maternal gene flow within local stands and understanding population dynamics and phylogeography in conjunction with past and present climate change [17,18].
Potential single nucleotide polymorphisms (SNPs) within the cpDNA of beech were recently investigated with the aid of RADseq methods [16]; however, the identification of reliable SNPs is challenging due to the presence of cpDNA sequences in nuclear genomes of several plants. Therefore, using complete chloroplast genomes is the ultimate way to assess cpDNA diversity within the species, allowing for the detection of all kinds of variation including SNPs, indels, or even structural variants. Recently, a complete chloroplast genome sequence for European beech was published [19], but no evaluation regarding the variability of the complete chloroplast genome was done. Therefore, it was the aim of this study to provide two additional chloroplast sequences from parts of the beech distribution area glaciated during the ice ages, which are known for their low chloroplast haplotype diversity [9,10,18], with the objective to search for highly variable regions suitable for population genetics, as previously done for East Asian beech [20]. For this, the ultra-centenary reference individual Bhaga from the National Park Kellerwald-Edersee (Germany) and an individual from the Jamy Nature Reserve (Poland), named Jamy, were used, and their chloroplast genome sequences were compared to the chloroplast sequence recently published for another German individual [19].

Initial Draft Assembly
For assembling the chloroplast of the reference individual of European beech, Bhaga [21], from the National Park Kellerwald-Edersee, genomic short (Illumina) and long (PacBio) reads were used. The PacBio reads were double corrected by applying first an Illumina correction using proovread [22] followed by self-correction using Canu [23]. PacBio reads were blasted against the completely assembled chloroplast of Arabidopsis thaliana [24], Eucalyptus grandis [25], and Fagus sylvatica [19]. Three individual BLAST searches were conducted on the PacBio reads at query length match cut-offs of 50%, 70%, and 90% and with default values for all BLAST parameters. The reads obtained at each cut-off level were used separately to calculate chloroplast assemblies using Canu. All three assemblies resulted in a longest contig with the same length, while several small contigs differed, depending on the dataset used. However, in none of the cases was the longest contig, which represents the chloroplast, circular and thus carefully refined with additional analyses (see below) to obtain a circular chloroplast genome.
The chloroplast genome of the Polish individual (named Jamy) originating from the Jamy Nature Reserve (18 • 56 6.07 E, 53 • 35 9.67 N) was built with 15,718,780 (PE 150-bp) Illumina reads, using a de novo assembler NOVOPlasty v 4.2 [26] with the sequence with the GenBank accession number AY042393.1 [27] as seed and the Bhaga chloroplast genome as the reference for guiding the assembler in inverted repeat regions. This resulted in a 158,462 bp circularised assembly with an average 481× coverage. The final chloroplast sequences have been deposited under the GenBank accession numbers MW531753 for Bhaga and MW537046 for Jamy.

Manual Curation
Illumina paired-end reads were used to check the consistency of the coverage of the reference assembly, as a lack of coverage or coverage highly deviating from the average coverage would indicate misassembled or duplicate regions. Subsequently, a self-BLAST of the sequence was done to identify the inverted repeat regions. The assembled sequence was cut at the points of potential misassembly (i.e., regions of lower coverage and the borders to the repeat regions and subsequently re-joined using bridging PacBio reads. Finally, Pacbio reads were aligned over the final sequence using Blasr [28] to verify coverage continuity. At this stage, the sequence was circularised with evidence from long read mapping. The accuracy of the circular chloroplast sequence was further improved by using Illumina paired reads using Pilon [29]. The chloroplast genome of the Polish individual was verified by alignment to the reference genome using MUMmer [30].

Annotation of Genic Regions
GeSeq ChloroBox [31] was used for annotation and graphical representation. For the annotation, the MPI-MP reference set of chloroplasts of land plants (CDS + rRNA), available from the ChloroBox server, was used. The graphical representation indicating the gene locations, the Large Single Copy region (LSC), the Small Single Copy region (SSC), and the two Inverse Repeat regions (IRA and IRB), was built using OGDraw [31].

Comparison of Chloroplast Assemblies
For the comparison the two chloroplast sequences reported in this study obtained from Bhaga (Germany) and Jamy (Poland), and the one previously published from another German individual (GenBank accession MK598696) [19], were aligned using Mummer [32] and inspected for the occurrence of SNPs. To verify the inversion found for the small single-copy region as compared to previously published genomes, long-read alignment onto the chloroplast configuration published by [19] and the one inferred in this study was done using Geneious R1 (Biomatters, Auckland, New Zealand).

General Features of the Chloroplast Assembly of Bhaga
The circularised chloroplast genome for the Fagus sylvatica reference individual Bhaga was 158,458 bp long. It contains two inverted repeat regions (IRA and IRB in Figure 1) of 25,873 bp each, along with a Small Single-Copy (SSC in Figure 1) region of 19,010 bp and a Large Single-Copy (LSC in Figure 1) region of 87,702 bp. In our assembly, all these four subunits are adjacent to each other and constitute a circular chloroplast. The functional annotation of the chloroplast genomes revealed 88 protein-coding gene locations, out of which 75 are single-exon genes, nine genes have two exons, three genes have three exons, and one gene has six exons. There are eight rRNA gene locations and 40 tRNA locations. Gene locations on both strands are shown in Figure 1.

Comparison of Chloroplast Assemblies
In the earlier assembly of a Fagus sylvatica chloroplast genome from Germany (Gen-Bank accession number MK598696) [19], there is a stretch of eight bases that is not assigned to any of the subunits. Alignment of the chloroplast genome published by Mader et al. [19] and the one from Bhaga revealed that the SSC locations of the previous and our assembly are inversely matching to each other with one gap and one mismatch. The gap is within a polyA stretch in the non-genic region and the mismatch is a C to T transversion which does not change the amino acid in the protein product of the gene ccsA. The inversion found in our study was confirmed with long-read mapping ( Figure 2). The other sequence stretches were directly matching to each other with one mismatch and three deletions. The mismatch is a G to T transversion that does not change the amino acid in the protein product of the gene atpE. At one place, there is a gap of two bases in a homopolymer of C, which is a non-genic location, and another gap is in a homopolymer of T inside the genic location of rpl23 gene, but which does not effect an amino acid change.
The chloroplast genome of the Jamy individual is identical in length to the chloroplast genome published by [19] and in comparison to the Bhaga genome showed only three indels located in three intergenic regions: C between psaC and ndhD; TT between matK and tRNA-Gln; and T between psbK and psgfd-bI genes.

Discussion
As one of the most important forest trees of central Europe, the European beech has been intensely studied throughout the past decades with respect to its ecology [33][34][35], phenology [36,37], and population genetics [11,38]. The latter studies are usually based on fingerprinting methods [18,38], but recently, SNP-based studies in beeches have been reported [11,12,16,39,40]. Especially for maternal inheritance, which can inform about seed dispersal rates, phylogeography, migration, and tracking of historic and current humanmediated seed transfer, chloroplast-based analyses are useful. Using manual curation to account for assembly errors, a circular chloroplast genome sequence of the Fagus sylvatica reference individual, Bhaga, from the German National Park Kellerwald-Edersee [21], could be obtained. A comparison to the chloroplast genome of a Polish beech individual, Jamy, and the recently published chloroplast genome of [19] from another German individual revealed surprisingly low variation, with only two SNP and three indel positions. Interestingly, the order of the chloroplast elements of Bhaga was found to be different from the previously published chloroplast genome of beech (MK598696) [19]. However, the structure of our chloroplast genome has been verified by long-read alignment, providing evidence of the accuracy of this structure. Whether there is a structural difference between the genome reported previously by Mader et al. [19] or if there was an assembly artefact needs to be clarified by future studies. In contrast to the low variation in beech, in the annual plant Microthlaspi erraticum, already several SNPs were encountered in the comparatively small trnL-F gene in central Europe [41], probably reflecting a higher dispersal ability. The high uniformity of the chloroplast genome is in line with earlier studies [9,18], but is contrasted by the significant variation observed in the nuclear genome of Fagus sylvatica in central Europe [11,35,42,43]. A reason for the low amount of chloroplast variation might be that central Europe was colonised primarily from small but quickly expanding refugial populations [9,18], resulting in a limited number of chloroplast haplotypes. As beech seeds are not frequently dispersed over long distances [44,45], the forerunners of the colonisation would inherit the same chloroplast type during postglacial colonisation of central Europe, while the non-maternally inherited genome content could diversify due to pollen influx. In line with the low genetic diversity in the chloroplast genomes, Meger et al. [19], using RADseq methods, identified only eight SNPs located in the beech chloroplast genome on a metapopulation dataset, which also included samples from the northern Carpathians.
Additional complete chloroplast genomes from a broad range of beech populations including glacial refugia will be the key to test the hypothesis of low seed dispersal rates and efficient gene-flow by pollen dispersal, despite previous assumptions of rather low pollen dispersal rates [46]. If the hypothesis will be confirmed and more variation in chloroplast diversity will be confirmed in chloroplast genomes from refugia in future studies, the chloroplast genome might offer a high-resolution tool to track the migration of Fagus sylvatica in the Holocene.