A Draft Genome Sequence for Ensete Ventricosum, the Drought-tolerant " Tree against Hunger "

We present a draft genome sequence for enset (Ensete ventricosum) available via the Sequence Read Archive (accession number SRX202265) and GenBank (accession number AMZH01. Enset feeds 15 million people in Ethiopia, but is arguably the least studied African crop. Our sequence data suggest a genome size of approximately 547 megabases, similar to the 523-megabase genome of the closely related banana (Musa acuminata). At least 1.8% of the annotated M. acuminata genes are not conserved in E. ventricosum. Furthermore, enset contains genes not present in banana, including reverse transcriptases and virus-like sequences as well as a homolog of the RPP8-like resistance gene. We hope that availability of genome-wide sequence data will stimulate and accelerate research on this important but neglected crop.


Introduction
Enset (Ensete ventricosum) is one of the most important crop plants grown in Ethiopia, where it makes a major contribution of to the food security of the country, feeding at least 15 million people.It buffers food deficit during dry spells and recurrent drought and has been dubbed as the "tree against hunger" [1].Enset is a multi-purpose crop, with all parts of the plant being utilized for human food, animal forage, medicine, or ornamental uses [2].Furthermore, it has the capacity for high yield, can be stored for long periods, can be harvested at any time of the year and at any stage over a period of several years [3], thereby offering advantages over seasonal crops.
The genus Ensete falls within the botanical family Musaceae, which also includes bananas and plantains (genus Musa).Enset is susceptible to some of the same diseases that threaten banana, including bacterial wilt caused by Xanthomonas campestris pathovar musacearum [4].Unlike banana, the main edible parts of the enset plant are the starchy corm and pseudostem.The genome of enset is diploid with n = 9 [5], while the recently published doubled-haploid banana genome sequence has n = 11 [6].
There are many clones and landraces of enset in Ethiopia [1,3].A collection of more than 600 clones and landraces from major enset growing areas of Ethiopia has been assembled and conserved ex situ by the Southern Agricultural Research Institute at Areka and some of these differ in important agronomic characteristics and tolerance to disease [7].Some attempts at molecular characterization of enset clones or landraces have been made using amplified fragment length polymorphism AFLP [8,9] and random amplified polymorphic DNA RAPD techniques [10,11], revealing the existence of genetic diversity and, therefore, the potential for improvement by breeding, if suitable markers were available.However, despite its importance and value, enset has been relatively neglected by scientific research and is arguably the least-studied African crop.There is an urgent need for efficient improvement of this crop.Our aim was to help accelerate enset research and crop improvement by providing draft genome sequence data and identifying single-nucleotide polymorphisms (SNPs) that might serve as molecular markers for marker-assisted breeding.We also aimed to investigate genetic similarity between enset and banana thus to assess the usefulness of banana genomic resources for application to enset.

Whole-Genome Sequencing
We generated 40.4 gigabases of whole-genome shotgun sequence data from the enset genome consisting of 202 million pairs of 100-nucleotide Illumina sequence reads.The sequence reads are freely available from the Sequence Read Archive under accession number SRX202265.Our approach was similar to that of Davey and colleagues [12] who recently re-sequenced the banana B genome (M.balbisiana) using 281 million pairs of 100-nucleotide Illumina sequence reads.Their attempt at de novo assembly yielded a highly fragmented genome assembly consisting of a large number of short contigs.However, they were able to gain insights into the B genome by aligning their sequence reads against the previously sequenced A genome (M.acuminata) and calling a consensus alignment [12].Likewise, we used both de novo sequence assembly (that is, without using a reference genome sequence) and an approach based upon alignment of reads against the banana A-genome reference sequence as described in the sections below.Our aligned enset genomic sequence reads covered 47% of the M. acuminata reference genome sequence (247 out of 523 Mb).This is less than the coverage by Davey and colleagues' alignment of M. balbisiana reads against the same reference genome, which covered 341 out of 523 Mb (65%), perhaps not surprisingly given the larger evolutionary distance between enset and the Musa species.
To check for contamination, we aligned our enset genomic sequence reads against all of the 2735 available complete prokaryotic genomes [13] using the Burrows-Wheeler Aligner BWA [14].We found that 8.27% of our sequence reads were alignable against prokaryotic bacterial sequences.The genome sequences showing the greatest coverage were Pseudomonas fluorescens SBW25 [15] and Methylobacterium radiotolerans JCM 2831 ( [16], GenBank: CP001001) with sequence reads covering 30.6% and 33.5% of the lengths of their genomes, respectively.These prokaryotic sequences possibly originate from endophytes and/or epiphytes associated with the plant even though we attempted to clean and sterilize the surface of the plant material by wiping with ethanol.We note that in the study by Davey and colleagues [12] there was also some bacterial sequence present in the M. balbisiana genomic re-sequencing data: 3.03% of Davey's data aligned to the prokaryotic genome sequences, with coverage of 94.3% of the Propionibacterium acnes 266 [17] chromosome, and 60.8% of the Serratia marcescens WW4 [18] chromosome.Therefore, it seems that bacterial contamination of plant genome sequence data is not unique to our study.We also note that the depth of coverage of any single bacterial genome by "plant" genomic reads is very low: no more than 2.03× for the P. fluorescens and M. radiotolerans genomes and no more than 9.1× for the P. acnes and S. marcescens genomes mentioned above, and, therefore, not enough to be effectively assembled de novo.

Estimation of the Enset Genome Length
Based on alignment against enset nuclear DNA sequences available in the GenBank database (Table 1), we estimate the depth of coverage as 67.67×.Given that we generated a total of 37.05 gigabases of sequence data (after removing prokaryote-matching reads) this would indicate a genome size of approximately 547 megabases.This is close to the haploid genome size of 523 megabases for the closely related M. acuminata [6].

Conservation of Protein-Coding Sequences between Enset and Banana
To identify which banana protein-coding genes are conserved in enset, we aligned our enset shotgun sequence reads against the 36,542 M. acuminata coding sequences identified by D'Hont and colleagues [6] using BWA [19].The advantage of this approach is that it is not confounded by incomplete assembly of or gene prediction in the enset data.The frequency distribution for breadth of coverage across these 36,542 sequences is shown in Figure 1.The breadths of coverage follow a bi-modal distribution with peaks close to zero and close to 100% coverage.The peak close to zero corresponds to banana genes that are either absent from the enset genome or else they are so divergent that the corresponding enset sequences fail to align.There are 662 (1.8%) banana protein-coding sequences that have zero coverage by the aligned enset data and are, therefore, absent, or very divergent, in enset.The Supplementary Data includes a spreadsheet indicating the breadths of coverage of each M. acuminata gene.

Heterozygosity and Single-Nucleotide Polymorphisms (SNPs)
Single-nucleotide polymorphisms (SNPs) can be valuable markers for crop improvement [20] but have not previously been reported for enset.Given the very fragmented nature of our de novo assembly of the enset genome, we followed the example of Davey and colleagues [12] by performing SNP calling against the high-quality reference genome sequence of M. acuminata [6].To do the alignment, we used BWA [14] and only considered sequence reads that uniquely align to a single genomic location.By aligning the enset shotgun sequence reads against this banana genome sequence, we were able to identify 30,287 sites at which there was an approximately 50:50 ratio between the two most frequent aligned nucleotides (where the most abundant base accounts for between 49% and 51% of the aligned bases and where coverage is at least 10×).These sites are distributed over the whole genome (see Figure 2) and occur on average every 17.3 kb.If we are less stringent and include all sites where the frequency of the most abundant base is between 48% and 52%, then the number of heterozygous sites increases to 76,416, a density of one site per 6.8 kb of banana genome.See Figure 3 for an example of such a locus, containing three heterozygous sites.See the Supplementary Data for a list of these heterozygous sites.The rationale for using the banana genome as a reference sequence for identifying heterozygous SNPs is that the banana reference genome sequence is much more contiguous and better annotated than the enset de novo genome sequence.However, one limitation of this approach is that it will fail to identify heterozygous sites that fall within enset-specific sequences.We found that alignment between enset genomic sequences reads and the banana reference genome sequence covered only 47% of the banana genome and occurred much more frequently in genes rather than intergenic regions, as also observed by Davey and colleagues [12] for alignment of M. balbisiana genomic reads against the same reference genome.To circumvent this limitation, we also generated lists of heterozygous sites called on the enset de novo assembly; these can be found in the Supplementary Data.

De Novo Assembly of the Enset Genome Sequence
Although alignment of raw sequence reads against the banana reference genome sequence is useful for identifying SNPs and sequences conserved between both plant species, we required a de novo assembly of the enset data in order to examine gene order and to identify enset sequences that are not present in the banana genome.Our assembly had a total length of 459.5 megabases.This represents 84% of the estimated enset genome-size of 547 megabases and is 97.3% of the length of the recently published banana genome assembly of 472.2 megabases [6].Given that our estimate of the enset genome size based on sequence coverage is very approximate and assuming that the enset genome is of similar size to the banana genome, then this suggests that our de novo assembly represents nearly complete coverage of the enset genome.Example of a protein-coding gene that is heterozygous in enset.We aligned enset genomic sequence reads against the banana genome using BWA.The figure shows a 40-nucleotide region of the alignment falling within a protein-coding gene (GSMUA_Achr1T20250_001), encoding a predicted acyl-transferase.This region includes three single-nucleotide polymorphisms, at which the enset genome sequence is heterozygous with approximately 50:50 frequencies for two haplotypes (C…C…T and G…T…C).
The enset genome sequence assembly is available via the GenBank database under accession number AMZH01.Due to restrictions on the numbers of contigs and supercontigs that GenBank can accept within a whole-genome shotgun project, GenBank only includes the enset contigs and super-contigs that are at least five kilobases in length.The full assembly, including contigs and super-contigs of between 200 and 5000 nucleotides, is available via Figshare [21].Approximately 70% of the enset genome assembly is alignable against the banana genome sequence and average nucleotide sequence identity is 89.90% over the alignable sequence, as judged by the dnadiff tool in the MUMmer [22] software package.
Given that about 8% of our genomic sequence reads actually originated from prokaryotes rather than from the plant, we checked our de novo assembly for prokaryotic sequences by performing Basic Local Alignment Search Tool nucleotide (BLASTN) searches against the 2735 available complete prokaryotic genomes [13].A total of 81,795 bp (0.018%) of the enset de novo assembly matched prokaryotic genome sequences.These sequences were removed from the data submitted to GenBank (accession AMZH01).
We performed a preliminary annotation of the enset genome assemblies using FGENESH [23] to predict protein-coding genes; summary statistics are given in Table 2 and the protein sequences, their genomic coordinates, results of BLASTP searches against the M. acuminata proteome, and the results of functional prediction using PfamScan [24] are available via Figshare [21] (the file was too large to be included in the Supplementary Data).Of 42,749 predicted proteins, 9967 did not have any significant sequence similarity to the banana proteome detectable by BLASTP.It should be noted that due to the fragmented nature of the draft de novo assembly, the number of predicted genes is likely to be significantly over-estimated as some gene models are split between multiple contigs.We used RfamScan [25] to identify non-coding RNA genes, including microRNAs, which are listed in Table 3, and we used RepeatMasker [26] to search for matches to repeat sequences (Table 4), as described in the Experimental Section.Overall, the enset assembly was predicted to have a greater repeat-content (32.65%) than the banana A genome (20.31%).
Gene order was highly conserved between banana and enset, at least over the scale of tens of kilobases, as exemplified in Figure 4, which shows an alignment of the longest enset super-contig against banana chromosome 5.However, we did identify some differences in gene-content between the two genomes as described in the following sections.Table 4. Overview and classification of the repeats present in the enset genome and comparison with those in the M. acuminata genome.

Enset-Specific Genes Include Reverse Transcriptases, Viral Sequences, and a Putative Disease-Resistance Gene
Among the enset genes not conserved in the M. acuminata genome [6], are several predicted to encode reverse transcriptases (Pfam accession PF00078).Reverse transcriptases are characteristic of several classes of mobile elements, including retroviruses, such as the banana streak virus.The phylogenetic relationships of these reverse transcriptases are shown in Figure 5, which indicates that they fall into two distinct clades.One of these clades (in the lower part of Figure 5) includes two genes from banana along with two from enset.However, the other clade (the upper part of Figure 5) includes no known sequences from Musa species, but includes sequences from several other monocot and dicot plants.
Similarly, the enset genome encodes at least 14 predicted proteins containing the integrase core domain (Pfam: PF00665) while the banana genome [6] encodes only one (see Figure 6).The integrase core domain is involved in integration of a copy of a viral genome into the host chromosome.The enset genome also encodes at least 19 predicted retrotransposon gag proteins (Pfam: PF03732) with no closely related sequence in banana (Figure 7).It has been shown that the genomes of some Musa species contain endogeneous retroviruses that are integrated into the host chromosome [28].The genome of E. ventricosum contains several sequences that resemble retrovirus sequences and therefore may represent endogeneous integrated viruses.Specifically, a M. balbisiana sequence containing eBSOLV (endogeneous Obino l'Ewai virus) sequence (GenBank: HE983609 [28]) is highly conserved in E. ventricosum, though this sequence is absent from the M. acuminata genome [6].Similarly, E. ventricosum contains sequences with 86% nucleotide identity to a 2.25-kb fragment of banana streak UA virus (GenBank: AEC49874) and 79% identity to a 1.1-kb fragment of the sugarcane bacilliform virus (SCBV) BT20231 (GenBank: FJ439799 [29]).It is not clear whether any of these virus sequences represent viruses that can become infectious as they can in Musa species [28].
Other enset proteins not found in the banana genome include a protein (GenBank: KB218027) that shares 42% amino-acid identity with Arabidopsis thaliana protein At1g53350, annotated as an RPP8-like resistance protein.Examples such as this are candidates for future studies on disease resistance in enset and perhaps even for introgression into banana.

Experimental Section
The E. ventricosum plant was grown from seed purchased from Jungle Seeds (Wallington, UK).We extracted genomic DNA using the DNAEasy Plant Minikit supplied by Qiagen (Manchester, UK).We sequenced genomic DNA using an Illumina HiSeq 2500, according to the manufacturer's instructions.We used a single lane of an eight-lane flowcell and generated 202 million pairs of 100-nucleotide reads with a mean insert-length of approximately 350 nucleotides.
For alignment of sequence reads against reference sequences, we used BWA version 0.7.5a-r405 [14] and visualized BWA alignments using the Integrative Genomics Viewer IGV [30].For de novo assembly we used SOAPdenovo version 1.05 [31].Prior to assembly, we removed all sequence reads that contained "N"s.Calculations of N 50 and NG 50 were based on the definitions of these two statistics stated by Assemblathon [27].
We used BLAST [32] and MUMMER [22] for pairwise alignments of assembled sequences and reference sequences and visualized BLAST alignments using the Artemis Comparison Tool (ACT) [33].We used MEGA5 [22] for phylogenetic analysis.
To identify repeat sequences, we used RepeatMasker version open-4.0.1 [26,34,35] in default mode run with RMBLAST version 2.2.27+ against the customized library of M. acuminata repeats (1903 sequences) from Hřibová and colleagues [36,37].This is the same library of banana-specific repeats used in the M. balbisiana genome re-sequencing project [12].
For ab initio gene prediction from our de novo genome assembly, we used FGENESH v.3.1.1 [22] with parameters tuned for 'monocot plant'.

Conclusions
Here we present the first genome-wide sequencing study of enset (Ensete ventricosum).We have identified more than 1000 candidate SNPs, and by using less stringent criteria, many more candidates could be identified.These data will be useful as a reference sequence for future "omics on this neglected crop.Armed with this initial draft genome sequence, we can now extend our studies to genotypic variation among different Ethiopian varieties of enset, both cultivated and wild.

Figure 1 .
Figure 1.Frequency distribution for breadth of coverage on 36,542 banana gene sequences by enset whole-genome shotgun sequence reads aligned against the banana genome using BWA.

Figure 2 .
Figure 2. Positions on the banana genome that display heterozygosity in enset.The horizontal axis indicates position on the chromosome and the vertical axis indicates the frequency of the most common base (A, C, G, or T).Only those sites are shown at which there is at least 10× coverage and at which the frequency of the most abundant base is between 49% and 51% inclusive.

Figure 3 .
Figure3.Example of a protein-coding gene that is heterozygous in enset.We aligned enset genomic sequence reads against the banana genome using BWA.The figure shows a 40-nucleotide region of the alignment falling within a protein-coding gene (GSMUA_Achr1T20250_001), encoding a predicted acyl-transferase.This region includes three single-nucleotide polymorphisms, at which the enset genome sequence is heterozygous with approximately 50:50 frequencies for two haplotypes (C…C…T and G…T…C).

Figure 5 .
Figure 5. Maximum-Likelihood phylogenetic tree for enset reverse transcriptase-domain proteins.Protein sequences from E. ventricosum are indicated by circles.The sequences from M. acuminata are indicated by diamonds.Bootstrap values of greater than 50% are indicated as numbers on the branches.

Figure 6 .
Figure 6.Maximum-Likelihood phylogenetic tree for enset integrase core-domain proteins.Protein sequences from E. ventricosum are indicated by circles.Bootstrap values of greater than 50% are indicated as numbers on the branches.

Figure 7 .
Figure 7. Maximum-Likelihood phylogenetic tree for enset integrase core-domain proteins.Proteins sequences from E. ventricosum are indicated by circles.The sequence from M. acuminata is indicated by a diamond.Bootstrap values of greater than 50% are indicated as numbers on the branches.

Table 1 .
Depths of coverage of previously published enset nuclear DNA sequences.The median depth of coverage is 67.67 times.
[27] 50 lengths[27]were calculated on the basis of an estimated genome length of 50 Mb.The total length of the scaffolds submitted to GenBank (under accession AMZH00000000.1)was less than 50% of this estimated length (7.54 Mb versus 25 Mb); therefore, it is not possible to calculate NG 50 length for this dataset.

Table 3 .
Predicted non-coding RNAs in the enset genome assembly predicted by Rfam version 11.