Generation of Chloroplast Molecular Markers to Differentiate Sophora toromiro and Its Hybrids as a First Approach to Its Reintroduction in Rapa Nui (Easter Island)

Sophora toromiro is an endemic tree of Rapa Nui with religious and cultural relevance that despite being extinct in the wild, still persists in botanical gardens and private collections around the world. The authenticity of some toromiro trees has been questioned because the similarities among hybrid lines leads to misclassification of the species. The conservation program of toromiro has the objective of its reinsertion into Rapa Nui, but it requires the exact genotyping and certification of the selected plants in order to efficiently reintroduce the species. In this study, we present for the first time the complete chloroplast genome of S. toromiro and four other Sophora specimens, which were sequenced de-novo and assembled after mapping the raw reads to a chloroplast database. The length of the chloroplast genomes ranges from 154,239 to 154,473 bp. A total of 130–143 simple sequence repeats (SSR) loci and 577 single nucleotide polymorphisms (SNPs) were identified.


Introduction
Sophora toromiro is an emblematic endemic small tree from Rapa Nui (Easter Island), and even though it had a crucial importance in the history of the island, it has been extinct in the wild since 1960 according to the International Union for Conservation of Nature (IUCN) [1]. Rapa Nui, a tiny land mass with an area of 166 km 2 , suffered the complete transformation of its terrestrial ecology, so that virtually no natural habitat survives [2]. In the beginning, the islanders used toromiro wood to craft ritual totems such as "Moai kava kava" and some other items with high religious importance [3]. The changes of Rapa Nui ecology show that the vegetal resources could not sustain the large growing population, so the islanders slashed and burned the native forest in order to build new houses and grow new crops to feed the people [4]. Also, animal grazing contributed to eradicate the last specimens of the island [5], except the one located inside Rano Kau crater that survived until about 1960 because of its difficult access. Nowadays S. toromiro can only be found in botanic gardens around the globe, such as the National Botanic Garden of Viña del Mar in Chile which holds the most well-documented archives and progenies of the last direct S. toromiro descendant from the island. SNPs on the other hand represents the substitution of a single nucleotide in a positionspecific manner within the genome, with applications ranging from resolving a population structure to barcoding generation to correctly differentiate species, among many others [35][36][37]. A major difference between SSRs and SNPs, is that the former can take multiple forms while the latter is mostly bi-allelic, however, the sheer density of SNPs markers that can be found in the genome of any plant or animal, along with is global distribution makes this marker one of the most useful to compare the chloroplast [38][39][40].
For all the reasons mentioned above, we decided to address a high-resolution strategy such as massive parallel sequencing, subsequently generating a bioinformatic pipeline to obtain robust and accurate results as a method to characterize and differentiate the remaining species of S. toromiro and its hybrids to help the reintroduction of this symbolic plant to Rapa Nui.
The aim of this study is to find a genetic tool that allows us to differentiate toromiro from its hybrids using genomic information based on the chloroplast sequences, markers distribution and variability inside this organelle.

Chloroplast Database Mapping
Our raw reads were mapped to a reference chloroplast database consisting of 2123 complete chloroplast genomes. The 5 samples had 5.511% of average alignment (2.73 million reads) with the chloroplast database (Table 1), corresponding to the sequences aligned once and more than once. The chloroplast database mapping results showed an alignment percentage within thẽ 5% expected range [41]. Table 1. Alignments, assembly and coverage. The average chloroplast Guanine + Cytosine content (GC) was obtained in the MIRA assembly report.
The genes shown in Table 3 were organized into a schematic map ( Figure 1) according to their real genomic locations. The genes on the outside circle are transcribed counterclockwise, and those on the inside are transcribed clockwise. A color is assigned to each gene, corresponding to their functional group. The grey inner circle represents the GC percentage and it also delimits the 4 regions of the chloroplast (IRa, IRb, SSC, LSC). These borders are shown in detail in Figure 2, where all 5 samples are aligned according to the quadripartite chloroplast structure. We show the relation of the borders with its closest gene. There are slight base pairs variations, but they are virtually the same.

Phylogenomic Analysis
A multiple sequence alignment of every Sophora sample and nine other plants of the Papilionoideae subfamily (described in methods) was performed using the complete chloroplast genome sequences. A. thaliana was included so that it could be used as an outgroup and a reference to the location of the tree root. The tree was represented as a dendrogram, in which the length of the branches is not representative of the evolutionary distance. 1000 bootstrap repetitions were performed to assess nodal support. The numbers located besides divergences ( Figure 3) correspond to the bootstrap support of each clade, which indicate the percentage of bootstrap replicates where the grouping occurs.
It is observed that only 2 clades of the phylogenomic tree have low bootstrap support (<70%, 1000 repetitions), corresponding to the divergence of the hybrid lines. ST1, phenotypically identified as S. toromiro, grouped together with ST-hyb1 and ST-hyb2 indicating a great similarity between the chloroplast genomes of these individuals. On the other hand, ST2 diverges closer to the root. These results, together with the phenotypic characterization, suggest that ST2 could correspond to a genuine toromiro. The SM sample, Sophora macrocarpa, groups close to the samples that we think could correspond to toromiro (ST2) due to its great similarity and geographic proximity.  Additionally, we aligned the psbA/trnH spacer sequence of our samples with the Sophora sequences presented by Shepherd and Heenan [17] in which they found that two genuine S. toromiro samples, obtained from herbaria, presented a 3 bp insertion. As expected, four of our samples, ST1, ST2, ST-hyb1 and ST-hyb2, presented the insertion (AAA) in the psbA/trnH spacer excluding SM which belongs to another Sophora species ( Figure S1). Assuming that maternal inheritance of the chloroplast DNA is occurring in all our samples, the presence of this insertion would indicate that four of them have a S. toromiro parent. Considering this, we would have expected to observe less variation between the chloroplast genomes of our four samples that present the 3 bp insertion, because all existing toromiros descend from a very reduced population. We suggest that paternal inheritance [49,50] may be playing a role in the genetic variability exhibited in the chloroplast genomes. These results show that the use of the complete chloroplast genome can be used as a genetic tool to aid in identifying toromiro hybrids because it can identify the chloroplast donor.

Chloroplast Genome Comparison
CgView is a comparative software for circular visualization of genomes that displays information in concentric rings by aligning the queries in a similarity order. The software uses the basic local alignment search tool (BLAST) to perform large-scale comparative analyses, showing the similarities and differences between species. In the outermost part of the circular graph, the genes of the sample used as query (ST2) are indicated, while the subject samples are in the concentric ring from the outside to the inside according to their similarity. Genetic features are mapped into different colors according to its similarities ( Figure 4). The black color represents 100% identity between the genomic sequences analyzed. The spectrum of red colors represents identities greater than 90%. The spectrum of blue colors represents identity values of less than 90%. The white color indicates that the sequences compared have 0% identity, so there is no similarity between them. Additionally, we aligned the psbA/trnH spacer sequence of our samples with the Sophora sequences presented by Shepherd and Heenan [17] in which they found that two genuine S. toromiro samples, obtained from herbaria, presented a 3 bp insertion. As expected, four of our samples, ST1, ST2, ST-hyb1 and ST-hyb2, presented the insertion (AAA) in the psbA/trnH spacer excluding SM which belongs to another Sophora species ( Figure S1). Assuming that maternal inheritance of the chloroplast DNA is occurring in all our samples, the presence of this insertion would indicate that four of them have a S. toromiro parent. Considering this, we would have expected to observe less variation between the chloroplast genomes of our four samples that present the 3 bp insertion, because all existing toromiros descend from a very reduced population. We suggest that paternal inheritance [49,50] may be playing a role in the genetic variability exhibited in the chloroplast genomes. These results show that the use of the complete chloroplast genome can be used as a genetic tool to aid in identifying toromiro hybrids because it can identify the chloroplast donor.

Chloroplast Genome Comparison
CgView is a comparative software for circular visualization of genomes that displays information in concentric rings by aligning the queries in a similarity order. The software uses the basic local alignment search tool (BLAST) to perform large-scale comparative analyses, showing the similarities and differences between species. In the outermost part of the circular graph, the genes of the sample used as query (ST2) are indicated, while the subject samples are in the concentric ring from the outside to the inside according to their similarity. Genetic features are mapped into different colors according to its similarities ( Figure 4). The black color represents 100% identity between the genomic sequences analyzed. The spectrum of red colors represents identities greater than 90%. The spectrum of blue colors represents identity values of less than 90%. The white color indicates that the sequences compared have 0% identity, so there is no similarity between them.  The ST2 sample is located in the outermost part because it was identified as S. toromiro in the phylogenomic analysis shown in Figure 3, and rest of the sequences correspond to the concentric rings ordered from outside to inside as: ST-hyb2, ST1, ST-hyb1, SM, S. alopecuroides and A. thaliana. Finally, in the innermost part of the graph is represented the GC content and the location of the genome sequence represented in kilo base pairs (kbps). When performing a visual inspection of the map (Figure 4), the great conservation among the genomes is evident, represented by the black colors in the rings, mostly in the IR regions where are rRNAs and tRNAs genes. This high conservation phenomenon goes hand in hand with an increase in GC levels (rRNA ~54% GC and The ST2 sample is located in the outermost part because it was identified as S. toromiro in the phylogenomic analysis shown in Figure 3, and rest of the sequences correspond to the concentric rings ordered from outside to inside as: ST-hyb2, ST1, ST-hyb1, SM, S. alopecuroides and A. thaliana. Finally, in the innermost part of the graph is represented the GC content and the location of the genome sequence represented in kilo base pairs (kbps). When performing a visual inspection of the map (Figure 4), the great conservation among the genomes is evident, represented by the black colors in the rings, mostly in the IR regions where are rRNAs and tRNAs genes. This high conservation phenomenon goes hand in hand with an increase in GC levels (rRNA~54% GC and tRNA~52% GC vs.~36% GC of the entire chloroplast genome), since it is common for tRNAs to be enriched in high GC palindromic regions that enable the structure to form the characteristic Plants 2021, 10, 342 9 of 16 clover shape. The innermost ring, corresponding to A. thaliana, presents mostly blue and white colors due to the great differences separating this organism from the Sophora genus. Finally, hypervariable regions with identities lower than 82% were identified among the Sophora samples, represented by the white lines throughout all the rings, corresponding to sectors of the ycf1 and rpoC2 genes, and the IGS atpI/atpH, petN/psbM, trnE-UUC/psbD and ycf3/psbI.

Phylogenetic Analysis
Hypervariable regions were selected to construct a bootstrap consensus tree, inferred from 1000 replicates, with S. macrocarpa (SM) as root. Figure 5 shows the phylogenetic reconstruction of the five Sophora samples using the molecular markers ycf1, rpoC2, aptI/atpH, petN/psbM, trnE-UUC/psbD and ycf3/psbI. Nodal support is shown besides divergences. This analysis involved five aligned sequences, one per sample, in which the molecular markers were separately aligned and then concatenated. There were a total of 8084 analyzed nucleotides per sample in the final data set. If chloroplasts are maternally inherited, the ST-hyb2 sample which exhibits phenotypic features of S. toromiro and S. macrocarpa, should present either a S. toromiro chloroplast or a S. macrocarpa chloroplast. The same occurs with the ST-hyb1 sample, which exhibits phenotypic features of S. toromiro and S. microphylla, and should present either a S. toromiro chloroplast or a S. microphylla chloroplast. The results presented in Figure 5 and Figure S1 suggest that four samples present variants of S. toromiro chloroplast. the characteristic clover shape. The innermost ring, corresponding to A. thaliana, presents mostly blue and white colors due to the great differences separating this organism from the Sophora genus. Finally, hypervariable regions with identities lower than 82% were identified among the Sophora samples, represented by the white lines throughout all the rings, corresponding to sectors of the ycf1 and rpoC2 genes, and the IGS atpI/atpH, petN/psbM, trnE-UUC/psbD and ycf3/psbI.

Phylogenetic Analysis
Hypervariable regions were selected to construct a bootstrap consensus tree, inferred from 1000 replicates, with S. macrocarpa (SM) as root. Figure 5 shows the phylogenetic reconstruction of the five Sophora samples using the molecular markers ycf1, rpoC2, aptI/atpH, petN/psbM, trnE-UUC/psbD and ycf3/psbI. Nodal support is shown besides divergences. This analysis involved five aligned sequences, one per sample, in which the molecular markers were separately aligned and then concatenated. There were a total of 8084 analyzed nucleotides per sample in the final data set. If chloroplasts are maternally inherited, the ST-hyb2 sample which exhibits phenotypic features of S. toromiro and S. macrocarpa, should present either a S. toromiro chloroplast or a S. macrocarpa chloroplast. The same occurs with the ST-hyb1 sample, which exhibits phenotypic features of S. toromiro and S. microphylla, and should present either a S. toromiro chloroplast or a S. microphylla chloroplast. The results presented in Figure 5 and Figure S1 suggest that four samples present variants of S. toromiro chloroplast.

Simple Sequence Repeats (SSR) Analysis
The majority of SSRs are in the intergenic spacers (69.4%) and the large single copy region (72.1%) shown as IGS and LSC, respectively, in Table 4. Mononucleotide SSRs correspond to 76.4% of all the reported tandem repeats. Every sample presented the same compound SSR in the rpoC2/rps2 spacer. We identified 4 unique SSR of ST2 which could be used for future biological validations, corresponding to (TATAT)3 and (AAT)4 in the matK/rbcL spacer, (AT)6 in the atpA/trnR-UCU spacer and (ATTT)3 in the rps16 intron. For further information, consult Supplementary Table S1.

Single Nucleotide Polymorphism (SNP) Analysis
The whole chloroplast sequences of the 5 samples were aligned for SNP extraction with the aim to find more ways to help in the identification of pure lines of toromiro. A total of 727 single nucleotide polymorphisms were extracted from the multiple sequence alignment (MSA) file using this approach, spread along the whole chloroplast. To filter this data two criteria were used: No SNP with a tri-allelic state and no indel product of gaps in the MSA were allowed. After the filtering, 577 SNPs were kept for the following analyses.
Using the annotated sequences from ST2 as the reference and with a "Bacterial and Plant Plastid" genetic code in SnpEff, the effect and coding sequences variants for all the 577 SNPs were annotated. The distribution of these 577 markers can be seen in the first column of the Figure S2, which covers almost the entire chloroplast. Based on the quadripartite structure of the chloroplast, the IR regions show the less density of effect which in turns shows the fewer SNPs detected by our MSA approach in these regions. A transition/transversion rate (Ts/TV) of 0.524 was found between our samples ( Figure S3), and most of the SNPs detected have an effect in the upstream and downstream regions, with 84% of all the detected effects falling into these two using the default setting from SnpEff of 5 kbps of size. Only 4.3% of the effect falls into exonic regions with most of them causing missense mutations ( Figure S3). Of all the annotated genes from ST2, 85 have variants generated from the 577 SNPs. The fact that not all the coding genes have variants shows the high conservation of their sequences, so no SNPs were detected for them using our MSA approach. However, this approach shows the full variants generated from all the SNPs and not samples specific or unique SNPs. To do a more specific approach, the SNPs were separated in the 5 samples, filtering them so only SNPs unique to one specific sample were retained. Supplementary Table S2 shows the resulting distribution of unique SNPs after this new filtering. Almost all the unique SNPs are found in the SM sample, with 303 SNPs assigned to this sample after filtering. This was expected, as this sample corresponds to a different species (S. macrocarpa). Among the shared SNP, the sample that has the most SNP in common with SM was ST-hyb2, which is a hybrid between S. toromiro × S. macrocarpa with 29 shared SNPs. Giving more weight to the phylogenetic analysis from Figure 5, ST1 and ST-hyb1 shared 18 SNPs, more than the 9 SNPs shared between ST1 and ST2, two supposedly pure lines (Table 5). Using this new subset of SNPs, a new variant analysis was made by comparing the unique markers of ST2, which was identified by our phylogenetic analysis as a pure toromiro, along with ST1 and the hybrid lines ST-hyb1 and ST-hyb2. To help the visualization, regions of 1 kbps without SNPs in any of the four compared samples were left out ( Figure 6).
When comparing genes with variants between the 5 samples, the sheer density of markers from the SM sample leaves no room for any of the other 4 samples to show any unique variant genes ( Figure S4, upper left diagram). With the aim of finding specific genes that have variants with the specific SNPs in ST2, a comparison of all genes between ST1, ST2 and both hybrids was made. This approach showed 2 genes that have unique variants in ST2, rps7 and ndhB ( Figure S4, upper right diagram). When comparing only ST2 and the hybrids, another gene was found, rpoC1 ( Figure S4, bottom diagram). Both, rps7, a ribosomal related gene and ndhB, a NADPH-related gene, have duplication in the IR regions with rps7 in the IRa and ndhB in the IRb. At the same time ndhB was detected to have an intron along with rpoC1, a transcription related gene. These three genes along with those with SSR variability could offer a robust marker for S. toromiro chloroplast identification. 2021, 10, x FOR PEER REVIEW Figure 6. Heatmap of SNP distribution and impact caused in the genes. ST2 ann used as reference for calculating gene variants. Each row represents a window o of variants caused by SNPs are represented with a red coloration to the right.
When comparing genes with variants between the 5 samples, the

Genomic DNA Extraction, Quantification and Quality Assessment
Genomic DNA extraction was carried out with GeneAll ® Exgene™ Plant SV mini kit using 100 mg of leaflets treated with liquid nitrogen. Quantification was performed with Quant-iT™ PicoGreen ® and the purity of the nucleic acid was evaluated by NanoQuant Infinite ® 200. The DNA quality was determined by agarose (1%) gel electrophoresis.

Single Nucleotide Polymorphism (SNP) Search and Characterization
The search for SNPs was carried out using a multiple sequence alignment (MSA) approach at using the whole sequences of the 5 Sophora samples using the ClustalW option and default parameters implemented in MEGACC. SNPs were extracted from the MSA file using SNP-sites [61] to generate a VCF file with the full collection of the markers. After filtering and keeping only bi-allelic state markers, the SNPs were annotated using SnpEff [62] using the chloroplast sequence from ST2 as the reference to see the distribution and impact on the genes predicted by GeSeq. A Venn diagram was compiled using the online tool from the Bioinformatics Evolutionary Genomics (www.bioinformatics.psb. ugent.be/webtools/Venn/ (accessed on 15 January 2021)) to select genes uniquely affected by specific SNPs from any of the 5 Sophora sequences. Graphics were compiled using GraphPad Prism 7 and adjusted using Illustrator.

Conclusions
In this study, we show for the first time five complete chloroplast genomes of Sophora plants, phenotypically identified as S. toromiro (ST1 and ST2), S. toromiro × S. microphylla (ST-hyb1), S. toromiro × S. macrocarpa (ST-hyb2) and S. macrocarpa (SM). The chloroplast genomes of all the samples were characterized, which adds valuable information to the current knowledge of the Sophora genus, with special focus on toromiro. We determined that 4 of our samples (ST1, ST2, ST-hyb1 and ST-hyb2) presented a unique S. toromiro molecular marker, reported by Shepherd and Heenan, in the psbA/trnH spacer. SSR were characterized and some unique loci of ST2 were selected to be used in a future biological validation. Using a second approach based on SNPs, 3 genes were found to have variants based on SnpEff analysis which could be used together with the loci detected by SSR analysis to improve the selection and the validation of Sophora species. Based on phylogenomics, phenotypic characterization and historical background, we suggest that the ST2 sample could correspond to a genuine S. toromiro (non-hybrid), whilst ST1 corresponds to a hybrid line that was phenotypically misclassified. ST-hyb 2, in turn, raises a question about the genetic inheritance of the chloroplasts because its genome presents similarities to both ST2 and SM samples. While ST2 and ST-hyb2 showed the greatest similarities between all the samples in the CCT map, overall, the samples that shared more SNPs correspond to ST-hyb2 and SM. These results suggest that paternal inheritance may be playing a role in the genetic variability of S. toromiro chloroplasts. A deeper analysis with a higher number of samples from the National Botanic Garden of Viña del Mar, and the development of nuclear markers is required to give a more accurate picture of the hybridization within the genus, in order to finally achieve a specific barcoding for this species. As a future prospect, the data presented here could help the inhabitants of Rapa Nui to reclaim their legacy with a more precise selection of species.