Developing Chloroplast Genomic Resources from 25 Avena Species for the Characterization of Oat Wild Relative Germplasm

Chloroplast (cp) genomics will play an important role in the characterization of crop wild relative germplasm conserved in worldwide gene banks, thanks to the advances in genome sequencing. We applied a multiplexed shotgun sequencing procedure to sequence the cp genomes of 25 Avena species with variable ploidy levels. Bioinformatics analysis of the acquired sequences generated 25 de novo genome assemblies ranging from 135,557 to 136,006 bp. The gene annotations revealed 130 genes and their duplications, along with four to six pseudogenes, for each genome. Little differences in genome structure and gene arrangement were observed across the 25 species. Polymorphism analyses identified 1313 polymorphic sites and revealed an average of 277 microsatellites per genome. Greater nucleotide diversity was observed in the short single-copy region. Genome-wide scanning of selection signals suggested that six cp genes were under positive selection on some amino acids. These research outputs allow for a better understanding of oat cp genomes and evolution, and they form an essential set of cp genomic resources for the studies of oat evolutionary biology and for oat wild relative germplasm characterization.


Introduction
Chloroplast (cp) genomics will play an important role in the characterization of crop wild relative (CWR) germplasm conserved in worldwide gene banks. Currently, many CWR collections are expanding to mitigate the threats of losing CWRs from climate change and other habitat disturbances and to conserve germplasm for plant breeding (e.g., see [1,2]). Thus, the need for CWR germplasm characterization is increasing. Most CWR germplasm has complex, polyploid, and/or uncharacterized genomes [3,4], and current tools based on nuclear genome sequences may not always be effective in identifying CWR germplasm and investigating its genetic variability [5,6]. Conserved CWRs are often misclassified or require taxonomic evaluation [7]. More informative barcoding [8] will be needed to distinguish among CWR accessions for specific traits or features. Acquired CWRs will be used to enhance the studies of plant biology and evolution (e.g., see [6,9]). The search for cp genes in conserved CWR will be increased for applications in genetic engineering [10]. Therefore, it is important to develop cp genomic resources for characterizing CWR germplasm.
Oat (Avena L.) is one of the most widely cultivated cereals and a valuable resource in many countries, both for human consumption and animal feed [11]. The genus Avena has up to 30 recognized species, including diploids, tetraploids, and hexaploids [7,12]. The cultivated hexaploid oat has 42 it caused a different translation of peptide tails. Two genes infA and rps16, coding for a translation initiation factor 1 and a S16 ribosomal protein, respectively, were detected in these oat species.

Comparison of Genomic Structures
Analyzing mVISTA percent identity plot revealed several major features of genomic variation, as illustrated in Figure 2 for nine oat species (and Supplementary File S4 for 25 oat species). First, no marked differences in genomic structure and gene arrangement were observed. Second, the degree of similarity between any two of 25 cp genome sequences ranged from 98.380% to 99.996%, with an average of 99.529%. Third, most of the nucleotide variations across the 25 cp genomes were located in intergenic regions. Fourth, sequence variation among the 25 genomes was also identified for the ndhH gene. Fifth, there were no specific variations in genomic structure and gene arrangement unique to each ploidy level.   motifs were mainly poly-A, with eight to 18 repeats; poly-C, with eight to 14 repeats; poly-AT, with 125 five to seven repeats; and poly-AG with five repeats (Table 3). Six abundant SSR motifs were poly-A, 126 with 8, 9, 10, and 11 repeats, followed by poly-C, with eight repeats and poly-AT, with five repeats. 127 However, no SSR motifs for tri-, tetra, pentra-, and hexa-nucleotides were found. 128

Sequence Variation and Divergence
The SNP calls based on the full length of 25 cp genome sequence alignments with the most stringent conditions revealed a total of 1313 SNPs for these 25 species (see Supplementary File S5). There were 583 SNPs (44.4%) located in the genic regions, 16 SNPs (1.2%) in the pseudogene, and 714 SNPs (54.4%) distributed in the intergenic regions.
The simple sequence repeat (SSR) analysis revealed considerable SSR polymorphism in these 25 oat cp genomes. A total of 6694 SSRs were identified for the 25 oat species. The SSR counts per species ranged from 256 (A. clauda and A. eriantha) to 280 (A. atlantica), with an average of 276.8. The SSR motifs were mainly poly-A, with eight to 18 repeats; poly-C, with eight to 14 repeats; poly-AT, with five to seven repeats; and poly-AG with five repeats (Table 3). Six abundant SSR motifs were poly-A, with 8, 9, 10, and 11 repeats, followed by poly-C, with eight repeats and poly-AT, with five repeats. However, no SSR motifs for tri-, tetra, pentra-, and hexa-nucleotides were found.  (2), tRNA-Arg (3), tRNA-Asn (2), tRNA-Asp, tRNA-Cys, tRNA-Gln, tRNA-Glu, tRNA-Gly(2), tRNA-His (2) Table 3. Simple sequence repeat (SSR) polymorphism found in the plastids of 25 Avena species. The sliding window analysis of nucleotide diversity showed that the genomic region with the highest nucleotide diversity was SSC, followed by LSC, while two repeat regions (IRa and IRb) had the lowest nucleotide diversity (Figure 3). Three specific genome positions with the highest diversity were the sliding windows near 108,349, 59,494, and 81,000 ( Figure 3). ago from A-genome species [18]. The indel variations for the ndhH gene seemed to be associated with 177 the divergence of the As-and AB-genomes, but not with the divergence of hexaploid oat lineage. The 178

SSR Type A A A A A A A A A A
cp genomic variations appeared to be not associated with the nuclear genome polyploidizations. 179  188

Selective Pressure Analysis
The positive selection analysis considered all 130 genes, with a total of 19,941 amino acids. The likelihood ratio tests for three models (M2a vs. M1a; M8 vs. M7; and M8 vs. M8a) appeared to suggest the presence of positive selections with p-values < 0.05 on many cp genes (Table 4)

Discussion
Our cp genomic analysis here generated the first set of cp genome assemblies for these 25 oat species. The genomes typically had four regions (LSC, SSC, IRa, and IRb), with lengths of roughly 136,006 bp, and they carried 130 genes and four to six pseudogenes. Purifying selection was the dominant force acting on the cp genes. The genomes harbored 1313 SNPs and 277 SSRs per species. More nucleotide diversity was located in the SSR and LSC, rather than IRa and IRb, regions. These research outputs allowed for a better understanding of oat cp genomes and evolution, and they formed an essential set of cp genomic resources for future oat studies and assessing oat genetic resources.
The analysis also revealed many comparative characteristics and some unique features of these oat cp genomes. First, the two genes infA and rps16, coding for a translation initiation factor 1 and a S16 ribosomal protein, respectively, were detected in these oat species, like other grass species, such as wheat and barley, but they were reported to be absent or nonfunctional in Malpighiales: Passiflora edulis, Jatropha curcas, and Manihot esculenta [29][30][31]. Second, a large sequence variation among 25 cp genomes was identified for the ndhH gene. Third, little differences in genomic structure and gene arrangement were identified across 25 species, and no marked genomic variations were unique to an oat ploidy level. Fourth, the positive selection was relatively weak acting on the 130 cp genes.
These comparative genomic features provided some empirical support for the previous inferences of oat evolution [9,18], as these genomes had the same gene count and arrangement, and purifying selection was the dominant force of selection acted on most of the 130 genes. Two C-genome species (A. clauda and A. eriantha) had only four pseudogenes, while the other 23 species had six. Such variation might be related to the major divergence of C-genome species 13-15 million years ago from A-genome species [18]. The indel variations for the ndhH gene seemed to be associated with the divergence of the As-and AB-genomes, but not with the divergence of hexaploid oat lineage. The cp genomic variations appeared to be not associated with the nuclear genome polyploidizations.
Managing more than 13,000 accessions of oat wild relatives is a challenging task, as many difficult issues must be properly addressed, such as taxonomic delimitation, duplication identification, viability monitoring, and field regeneration [17,32]. The cp genomic resources developed here can be utilized to develop taxonomic barcodes [8] for germplasm identification, e.g., from the region between the psbZ and trnfM (CAU) genes (see Supplementary File S4). Specific barcoding to identify wild relative germplasm of specific interests such as ecological types and geographic origin can also be developed. Molecular characterization of wild relative germplasm using cp markers, such as SNPs or SSRs, may be more informative than those using nuclear genomic tools, as cp markers are more conservative and nuclear genomic markers developed from polyploidy plants may not necessarily be informative. The inferred maternal phylogeny from these cp genomic resources (e.g., see [18]) can provide some guide for the search of evolutionarily related, economically important cp genes for gene introgression into oat breeding programs [15,16] through cp genome engineering [10]. The cp genome sequences are also essential for selecting intergenic spacer regions in oat cp genomes for transgene integration and assessing cp genome regulatory sequences in transgene expression [10]. We have initiated research to utilize these developed genomic resources to enhance the conservation, management and utilization of the oat collection in Plant Gene Resources of Canada.

Plant Material
We selected 25 Avena accessions of known species identity from the Plant Gene Resources of Canada (PGRC) oat collection, based on our previous oat research [3]. The selected accessions originated from various regions around the world and represent 25 species of the six botanical sections of the Avena genus, Ventricosa, Agraria, Tenuicarpa, Pachycarpa, Ethiopica, and Avena, and five distinct nuclear genomes organized in diploid (A or C), tetraploid (AB or AC), and hexaploid species (ACD). Table 1 shows the detailed information on the selected accessions with PGRC accession numbers, including botanical section and ploidy. About 300 seeds from each accession were planted in August 2013 in a 15 cm pot, grown for 8 to 10 days in the greenhouse at Saskatoon Research and Development Centre of Agriculture and Agri-Food Canada, and then incubated in the dark for 48 to 72 h. Up to 15 g from all 300 seedling leaves were collected and washed in cold water. Leaves were cut into 1 cm pieces with scissors, snap frozen with liquid nitrogen in a -20 • C mortar, and then ground to a fine

Chloroplast Genome Assembling and Annotation
All raw sequence reads were cleaned first with cutadapt [35] to remove sequence adapters and to perform quality trimming. Partial Nextera adapter sequence 'AGATGTGTATAAGAGACAG' was used to trim the raw sequence reads. All the sequence reads with both quality lower than 15 and shorter than 150 bp were discarded. SPAdes v3.11.1 [36] was used as the assembler for the circular cp genome assembly in the pair-end mode. Preliminary tests were performed to reach the least contigs number and the longest scaffold size by a series of combinations of different coverages and k-mer sizes. The k-mer size was eventually set to 127, and the coverage was set to 1000 folds, after a series of training analyses. The four major gaps located at the four junctions (LSC-IR, IR-SSC, SSC-IR, and IR-LSC) were filled in by the assistance of the four junction sequences. These junction sequences were obtained from the alignments of the scaffolds with their closely related species, including wheat (Triticum aestivum, NCBI Reference Sequence: AB042240.3; [37]), bent grass (Agrostis stolonifera; NCBI Reference Sequence: NC_008591.1), and ryegrass (Lolium perenne; NCBI Reference Sequence: NC_009950.1) cp genome. Each of the four junction sequences (ranging between 540 and 700 bp) containing both IR and another (either LSC or SSC) structure fragment was used as a bait to screen for reads for further gap sequence recovery. The selected reads from BLAST were also used to link adjacent structure fragments. The additional gaps located within the scaffolds of several Avena accessions were similarly filled with the assistance of the bait sequences acquired from either wheat or other Avena cp genomes with sequences at the same locations.
Gene annotations of 25 cp genomes were made using online DOGMA program [38], along with the cp genome annotations of wheat (NCBI Reference Sequence: AB042240), ryegrass (NCBI Reference Sequence: NC_009950), and bent grass (NCBI Reference Sequence: NC_008591.1). Manual curation was also made for the variations within coding genes, such as rRNA and tRNA, based on multiple sequence alignments with their closely related species in the Triticeae tribe. The circular maps of the Avena cp genomes were generated first with GenomeVx [39] and Circos [40] and finally merged in Inkscape (https://inkscape.org).

Comparative Genomic Analysis
To identify the genomic regions with substantial variability, the complete cp genomes of 25 Avena species were compared using mVISTA [41], with wheat cp genome as reference. For this comparison, the percent identity matrix among 25 cp genomes was also generated. To illustrate the genomic variations with respect to ploidy, further effort was also made, using A. eriantha cp genome as reference to compare nine cp genomes, representing diploid, tetraploid, and hexaploid species.

SNP, SSR, and Diversity Analysis
The SNP calling was performed on the basis of multiple sequence alignments (MSA) by SNP-sites with the default options [42]. MAFFT was used to generate MSA data with the FFT-NS-i×1000 alignment algorithm [43]. To identify SSRs, 25 cp genomes were analyzed using MISA [44], with the following setting of minimum numbers of repeats to 8, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra, pentra-, and hexa-nucleotides, respectively. The sliding window diversity analysis was done using DnaSP v6 [45] to estimate nucleotide diversity across 25 oat species, with a sliding window of 2000 bp and step size of 200 bp.

Selective Pressure Analysis
We applied several site models (M0, M1, M2, M3, M7, and M8), implemented in codeml of PAML v 4.9i [46], to estimate the Ka/Ks and ω values, considering F3X4 codon frequencies. MAFFT, with the default options, was used to align the nucleotide sequences of all cp genes, and the phylogenetic tree of 25 Avena species, without tree-branch lengths, was obtained from the previous phylogenetic analysis [18]. Four nested site models (M3 vs. M0; M2 vs. M1; M8 vs. M7; and M8a vs. M8) were evaluated by log-likelihood ratio tests (LRT). The positively selected sites were analyzed by Naïve Empirical Bayes (NEB) analysis and Bayesian Empirical Bayes (BEB) analysis. Extra effort was also made to perform a maximum likelihood analysis of natural selection codon-by-codon, using MEGA 7 [47], following the method of Suzuki and Gojobori [48].