Genome Survey Sequencing of Acer truncatum Bunge to Identify Genomic Information , Simple Sequence Repeat ( SSR ) Markers and Complete Chloroplast Genome

The Acer truncatum Bunge is a particular forest tree species found in the north of China. Due to the recent discovery that its seeds contain a considerable amount of nervonic acid, this species has received more and more attention. However, there have been no reports of the genome in this species. In this study, we report on the Acer truncatum genome sequence produced by genome survey sequencing. In total, we obtained 61.90 Gbp of high-quality data, representing approximately 116x coverage of the Acer truncatum genome. The genomic characteristics of Acer truncatum include a genome size of 529.88 Mbp, a heterozygosis rate of 1.06% and a repeat rate of 48.8%. A total of 392,961 high-quality genomic SSR markers were developed and a graphical map of the annotated circular chloroplast genome was generated. Thus far, this is the first report of de novo whole genome sequencing and assembly of Acer truncatum. We believe that this genome sequence dataset may provide a new resource for future genomic analysis and molecular breeding studies of Acer truncatum.


Introduction
The Acer truncatum Bunge is a particular forest tree species found in the north of China [1,2].The seeds contain a considerable amount of nervonic acid (24:1; cis-tetracos-15-enoic acid) and it has been officially classified as an edible oil by the Ministry of Health of China [3].There are only a few plant species with high amounts of nervonic acid in their seed oil.Nervonic acid is widely present in the sphingolipids of the nervous system of vertebrates and is a core component of brain fibers and nerve cells [4].Nervonic acid oils have become important targets for pharmaceutical and nutraceutical applications that aim to treat a number of neurological disorders [5].This tree species is also important in various disciplines, including botany, geography and climatology.Thus far, research into the utilization of Acer truncatum has focused on leaf extracts and there is a distinct lack of genomic information [6][7][8][9][10].
Recently, genome survey sequencing by next generation sequencing (NGS) has emerged as an important and cost-effective strategy for generating a wide range of genetic and genomic information for different plant species [11][12][13][14][15][16][17][18].It can also identify a large number of molecular markers for breeding [19].The SSR marker is the most widely used molecular marker system.SSR molecular markers have been developed for many species through NGS [11,13,15,16,18].An increased molecular marker density will enable better genome-wide association and molecular breeding studies [20,21].Chloroplasts are plant-specific and important organelles, which are important in photosynthesis and carbon fixation in green plants [22].The development of NGS technologies has also allowed for the sequencing of entire chloroplast genomes.Many chloroplast genomes of different Sapindales species have already been determined by NGS [23][24][25][26].
In the present study, the Acer truncatum genome sequence produced by genome survey sequencing is reported.The genome was assembled and then used to develop SSRs.After data filtering, we obtained 61.90 Gbp of high-quality data, representing approximately 116x coverage of the Acer truncatum genome.The genomic characteristics of Acer truncatum include a genome size of 529.88 Mbp, a heterozygosis rate of 1.06% and a repeat rate of 48.8%.Furthermore, 392,961 high-quality genomic SSR markers were developed.A graphical map of the annotated circular chloroplast genome was also generated.Thus far, this is the first report of de novo whole genome sequencing and assembly of Acer truncatum.In short, we believe that this genome sequence dataset may provide a new resource for future functional genomic and evolutionary analysis studies of Acer truncatum as well as for studies focusing on its molecular breeding.

Plant Materials
Fresh leaves from a single individual Acer truncatum tree (approximately 60 years old) were collected from the testing grounds of Northwest A&F University in Yangling Country, Shannxi Province, China (Figure 1).The genomic DNA was isolated from the fresh leaves with the DNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA).

Illumina Sequencing Data Analysis and Assembly
The whole Acer truncatum genome sequencing was performed by the Illumina Hiseq Platform (Illumina Inc., San Diego, CA, USA).Clean data were obtained by performing rigorous quality assessments and conducting data filtering on the Illumina sequencing raw data.The clean data with high-quality reads were assembled using the de Bruijn graph-based SOAPdenovo software (version 1.05, BGI, Beijing, China) [27].High-quality Illumina sequencing reads were submitted to the NCBI Short Read Archive (accession number: SUB4843212).

Genome Size Estimation, GC Content and Genome Survey
The clean data with high-quality reads were used for K-mer analysis.Based on k-mer (k = 21) frequency distributions, we used GENOMESCOPE to estimate the characteristics of the genome (genome size, repeat content and heterozygosity rate) [28].The 10-kb non-overlapping sliding windows along the assembled sequence were used to calculate GC content and sequencing depth.

Assembly and Analysis of the Chloroplast Genome
Using NOVOPlasty software version 2.6.7 [31], the chloroplast genome was assembled.The resulting assembled chloroplast genome was annotated based on GeSeq [32] and comparisons with the chloroplast genomes of Acer davidii (NC_030331) [23], Acer morrisonense (NC_029371) [24], Acer griseum [25] and Acer miaotaiense (NC_030343) [26].A graphical map of the annotated circular chloroplast genome was generated using the OGDRAW program [33].The phylogenetic tree was reconstructed with Maximum Likelihood (ML) algorithms using MEGA6 software [34].The phylogenetic analysis was based on the coding of gene nucleotides.The number of bootstrap replications was 500 for the phylogeny test.The high-quality annotated complete chloroplast genomic sequence was deposited in GenBank with the accession number MH716034.

Genome Sequencing and Sequence Assembly
To obtain a sufficient amount of DNA for sequencing libraries, we isolated DNA from the leaves of Acer truncatum (Figure 1).More than 61.90 Gbp of high-quality data with clean reads (Q30 94.8%) was generated from sequencing library (300 bp) by the Illumina HiSeq sequencing platform.This was an approximately 116x coverage of the Acer truncatum estimated genome size (Table S1).All the high-quality and clean data were used for genome sequence assembly analysis.The high-quality and clean data represented over 30x coverage, which indicates successful genome survey sequencing [27].The high-quality and clean data were de novo assembled (K-mer = 75) using the de Bruijn graph-based SOAPdenovo software [35].The Acer truncatum genome assembly consisted of 2,412,582 scaffolds with a total length of 866,062,477 bp and a scaffold N50 length of 735 bp (Table 1).After filtering for scaffolds with a size of <1000 bp, Genscan and Augustus were used to predict genes with parameters trained on Acer truncatum.Genes were annotated based on the following nine databases: NR, NT, Swiss-Prot, KEGG, KOG, Pfam, GO, COG and TrEMBL.The functional annotations for the genes are shown in Table S2.For example, we used BLASTX to perform a similarity analysis and comparison against the NR database.Among a wide range of plants, the assembled Acer truncatum genomic sequence had the highest number of hits against Citrus sinensis (Figure S1).This result is consistent with previous transcriptome sequencing findings [29].The N50 of scaffolds and contigs was calculated by ordering all sequences, before adding the lengths from the longest to shortest until the added length exceeded 50% of the total length of all sequences.N90 is similarly defined.

Genomic Characteristics
For the 21-mer frequency distribution, the number of k-mers was 47,661,885,129 and the peak of the depth distribution was at 75x.The estimated genome size of Acer truncatum was 529.88 Mbp (Figure 2).The position at a point that is half of the height of the main peak (~38x) indicates the heterozygosis rate, with the heterozygosis rate in this genome found to be approximately 1.06%.This is a relatively high heterozygosis rate.The repeat peak at the position of the integer multiples of the main peak indicates about 48.8% of the Acer truncatum genome (Figure 2).A scatterplot of the genomic GC content and sequencing depth can provide information on sequencing data bias (Figure S2).The GC content of Acer truncatum genome was 35.04%.This is a mid-GC content.As shown in the GC content analysis in Figure S2, there is another dense area (red area), which may be caused by the high rate of heterozygosity (1.06%).

Genomic SSR Marker Development
The assembled scaffolds were searched for the presence of SSR markers by using the MISA software (http://pgrc.ipk-gatersleben.de/misa/misa.html).A total of 392,961 putative SSR markers from 145,640 scaffolds were identified (Table S3).Among the identified SSR markers, the Di-nucleotide was the most abundant SSR marker, accounting for 69.25% of the total SSR markers, which was followed by Tri-(21.36%),Tetra-(6.56%),Penta-(1.71%)and Hexa-(1.21%)nucleotide SSR markers (Figure 3A).There was a large proportion of both Di-nucleotide and Tri-nucleotide SSR markers while the rest amounted to less than 10%.In the Di-nucleotide SSR markers, the AT/AT repeat motifs accounted for 71.31%,AG/CT accounted for 20.01%, AC/GT accounted for 8.65% and CG/CG only accounted for 0.03% (Figure 3B).The predominant Tri-nucleotide SSR markers, the AAT/ATT repeat motifs, the AAG/CTT repeat motifs and the ATC/ATG repeat motifs accounted for 54.72%, 22.10% and 6.61%, respectively (Figure 3C).
The SSR markers categorized by the number of repeat motifs were summarized (Figure 4).The Diand Tri-nucleotide SSR markers were far more prevalent than the other SSR markers.The number of SSR markers decreased with an increased repeat motif length.

Assembly of Chloroplast Genome
The complete annotated chloroplast genome of Acer truncatum was a double-stranded circular DNA with a length of 156,241 bp.The structure of the chloroplast genome is similar to that of other plant species, including two inverted repeat (IRA and IRB) regions (52,172 bp), a large single-copy (LSC) region (85,972 bp) and a small single-copy (SSC) region (18,097 bp) (Figure 5).In the Acer truncatum chloroplast genome, 132 functional genes were predicted, including 87 protein-coding genes, 37 tRNA genes and 8 rRNA genes.Most of these genes appear in a single copy while 17 gene types appear as double copies or more, including six protein-coding gene species (rps7, rps12, rpl2, rpl23, ycf2 and ndhB), seven tRNA gene species (trnA-UGC, trnE-UUC, trnL-CAA, trnM-CAU, trnN-GUU, trnR-AC and GtrnV-GAC) and four rRNA gene species (rrn23, rrn16, rrn5 and rrn4.5).The overall GC content in the chloroplast genome was approximately 37.9%.Contractions and expansions of the IR regions at the borders are common evolutionary events and represent the main reasons for the size variation among chloroplast genomes.A detailed comparison of four junctions, which were namely LSC/IRA, LSC/IRB, SSC/IRA and SSC/IRB, between the two IRs (IRA and IRB) and the two single-copy regions (LSC and SSC), completed the relative Acer miaotaiense chloroplast genome (Figure S3).In order to investigate the placement of Acer truncatum chloroplast genome within the order Sapindales, we performed a phylogenetic analysis based on 75 chloroplast protein-coding genes for 12 species with the Arabidopsis thaliana chloroplast genome as an outgroup (Figure 6).The topology of the phylogenetic tree is basically consistent with the traditional taxonomy of the order Sapindales.The results suggest that Acer truncatum is closely related to the four congeners Acer miaotaiense P. C. Tsoong, Acer davidii Franch, Acer morrisonense Hayata and Acer griseum Pax.These four taxa together with Dipteronia sinensis Oliv.and Dipteronia dyeriana Henry form a monophyletic clade in the family Aceraceae.

Conclusions
In this present study, the Acer truncatum genome sequence produced by genome survey sequencing was reported.The genomic characteristics of Acer truncatum include a genome size of 529.88 Mbp, a heterozygosis rate of 1.06% and a repeat rate of 48.8%.A total of 392,961 high-quality genomic SSR markers were developed and a graphical map of the annotated circular chloroplast genome was generated.These results and dataset may provide a new resource for future genomic analysis and molecular breeding studies of Acer truncatum.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1999-4907/10/2/87/s1. Figure S1: Species distribution of the top BLAST hits in the NR database, Figure S2: GC content and average sequencing depth of the Acer truncatum genome data, Figure S3: Comparison of the LSC, SSC and IR regions in chloroplast genomes of Acer truncatum and Acer miaotaiense, Table S1: Statistics of Acer truncatum sequencing data, Table S2: Statistics of gene functional annotation, Table S3: The SSR types detected in the Acer truncatum sequences.
Author Contributions: R.W., P.C., L.Z. and M.Z.performed the experiments.L.L. and J.F. analyzed the data, prepared figures and tables, wrote the paper and reviewed drafts of the paper.All authors read and approved the final manuscript.

Figure 2 .
Figure 2. k-mer distribution as calculated by Genomescope.Blue bars represent the observed k-mer distribution; black line represents the modelled distribution without the k-mer errors (red line) and up to a maximum k-mer coverage specified in the model (yellow line).length, estimated genome length; uniq, unique portion of the genome (nonrepetitive elements); het, genome heterozygosity; rep, repetitive portion of the genome.

Figure 3 .
Figure 3. Characteristics of SSR markers.(A) Frequency of different SSR markers; (B) Frequency of different Di-nucleotide SSR markers; (C) Frequency of different Tri-nucleotide SSR markers.

Figure 4 .
Figure 4.The distribution and frequency of SSR motif repeat numbers.

Figure 5 .
Figure 5. Physical map of the circular chloroplast genome of Acer truncatum.

Table 1 .
Information of the assembled genome sequences of Acer truncatum.