Faba bean (Vicia faba
L.) is a cool-season legume species, producing protein-rich grain not only for human production (particularly in Western Asia and Northern Africa), but also for livestock feed in developed regions, such as Europe and Australia [1
]. Global cultivation on 2.4 Mha in 2012 produced circa 4 Mt [3
], the primary production countries being China, Ethiopia, Morocco, and Australia.
Productivity of faba bean is limited by a number of biotic stresses, including diseases caused by viral, bacterial, and fungal pathogens, and invertebrate pests, such as nematodes. The major fungal diseases are chocolate spot (caused by Botrytis fabae
Stard.), rust (caused by Uromyces viciae-fabae
[Pers.] J. Schrött), ascochyta blight (caused by A. fabae
Sperg.), and downy mildew (caused by Peronospora viciae
[Berk.] Caspary) [1
]. Constraints due to environmental stresses, such as drought [6
], salinity [7
], and cold and frost [8
] are also significant. Quality characteristics, such as protein content [9
] and absence of anti-nutritional factors, like tannins [10
], are also important targets for breeding improvement.
The genus Vicia
belongs to the Viceae tribe within the Galegoid (cool-season) clade of the sub-family Papilionoideae, which is, in turn, part of the legume family, Fabaceae [11
contains more than 160 species, with considerable variation of haploid genome size [12
], corresponding to ca. 7.5-fold range from ca. 1862–14,112 Mbp. The genome size of faba bean is close to the upper boundary, at ca. 13,000 Mb [14
]. Differential genome expansions within the Vicia
genus are apparently largely attributable to amplifications of large retroelement sequences [15
], although the basic genetic complement is likely to be conserved.
Faba bean is a facultative outbreeding species, levels of natural cross-pollination varying from 10% to 70% [16
]. The fundamental chromosome number is 6, providing a diploid constitution of 2n
= 12. Development of genetic linkage maps for faba bean has been relatively slow compared to other pulse species, and was initially dominated by construction of low-density genetic maps populated by first-generation molecular marker systems, such as isoenzymes, restriction fragment length polymorphisms (RFLPs), and randomly-amplified polymorphic DNAs (RAPDs) [17
]. Later studies based on sequence-characterised marker systems, such as intron-targeted amplified polymorphism (ITAPs), simple sequence repeats (SSRs), and sequence-characterised amplification regions (SCARs), permitted the development of more substantial maps, including the basis for comparative analysis with the genomes of other legume species [14
]. Development of large-scale sequence resources, such as expressed sequence tags (ESTs) [21
] or genome survey sequences (GSSs) [24
] allowed further expansion of marker development, particularly for SSRs and single-nucleotide polymorphisms (SNPs), with associated improvement of genetic map resolution [25
]. Various population-specific genetic maps have been used for detection of quantitative trait loci (QTLs) for agronomic performance, environmental stress tolerance, and disease resistance characters [28
Although current resources are sufficient to support simple strategies for genomics-assisted breeding in faba bean, such as marker-assisted back-crossing [1
], a transition to genomic selection [37
], which depends on high-density genome-wide sequence polymorphism information will require a significant expansion of existing sequence collections. Ideally, a reference whole-genome sequence would be used in conjunction with resequencing of selected individuals from training populations. However, for large higher plant genomes, such as that of faba bean, whole genome sequence assembly remains a technically challenging proposition because of the abundance of repetitive DNA sequences. In contrast, sequencing of the transcriptome, which corresponds to the expressed proportion of the genome, provides an attractive option, especially through use of RNA-Seq technology on second-generation DNA sequencing platforms [38
A number of transcriptome sampling studies have previously been performed for faba bean, with an emphasis on specific developmental stages or environmental conditions, or by using one type of source tissue [29
]. However, there is an incentive to generate a more comprehensive resource through sampling of RNA populations from multiple tissue sources, allowing the construction of a transcriptome atlas, as previously described for grain legume species, such as soybean and field pea [44
]. Apart from large-scale development of gene-associated molecular genetic markers for the purposes of genomic selection, transcriptome atlases can support gene isolation, identification of differentially-regulated gene sets, and the measurement of gene expression, as well as studies of comparative genomics, and (ultimately) annotation of whole genome sequences [38
].The current study reports on comprehensive transcriptome assemblies using RNA-Seq from two Australian faba bean cultivars (Doza and Farah) that differ in their breeding habit, adaptation characteristics, and disease resistance. Doza is resistant to rust infection and tolerant to frost events, and is best adapted to regions of New South Wales and Southern Queensland that experience warmer spring temperatures [47
]. Since Doza is susceptible to ascochyta blight and, therefore, has a significantly lower yield, cultivar Farah is favoured in the region of South Australia, to which it is well-adapted [48
]. Farah is an older cultivar (registered in 2003) [48
] compared to Doza (registered in 2008) [49
]. Comparisons of the respective transcriptome assemblies have been made to the gene complements of several related species. The annotation of unigenes has been performed, and patterns of tissue-specific expression have been characterised. The faba bean transcriptome dataset generated in this study will provide an important resource for future genomics-assisted breeding activities in this species.
4. Materials and Methods
4.1. Plant Material
Three plants from each of the Doza and Farah cultivars were used in this study. The plants were germinated and maintained in standard potting mix in 200 mm plastic pots at 22 ± 2 °C with a photoperiod of 16/8-h (light/dark) within the glasshouse at AgriBio, Bundoora, Victoria, Australia. To prevent any problems with cross-pollination, the plants were isolated through the use of net enclosures during periods of flowering. Multiple tissue sources of plant material from both cultivars were harvested at various time points, in three replicates. Stem and leaf tissues from multiple nodes, along with roots, were sampled from four-week-old plants. Immature pods and fully-open flowers were collected within 8–12 days after flowering. Pods and immature seeds were sampled within 18–23 days post-flowering. The harvested tissue was snap-frozen in liquid nitrogen and stored at −80 °C until RNA extraction was performed.
4.2. RNA Extraction
The three replicates from each of the different source tissues were combined in equimolar quantities prior to grinding of tissue for the RNA isolation step to minimise variability across biological replicates. Total RNA was extracted and treated with DNase I (Qiagen, Hilden, Germany) using the RNeasy® Plant Mini Kit (Qiagen) following the manufacturer’s protocol. The isolated total RNA samples were quantified using a spectrophotometer (Thermo-Scientific, Wilmington, DE, USA) at the wavelength ratios of A260/230 and A260/230. The extracted samples were resolved on 1.2% (w/v) denaturing agarose gel to assess the integrity of RNA.
4.3. Library Preparation and Sequencing
RNA-Seq libraries were prepared with the SureSelect Strand-Specific RNA Library Kit according to the protocol described by the manufacturer, with the exception of the poly(A) RNA fragmentation time. The purified poly(A) RNA was fragmented to an approximate insert size of 350 bp at 94 °C for a minute, instead of 8 min as recommended in the protocol. The libraries were assessed on an Agilent TapeStation 2200 platform with D1000 ScreenTape (Agilent Technologies, Santa Clara, CA, USA) following the manufacturer’s protocol. Each library was prepared with a unique indexing primer, and all the libraries were multiplexed in an equimolar concentration to generate a single pool. The multiplexed pooled sample was quantified using a KAPA library quantification kit (KAPA Biosystems, Boston, MA, USA) according to the protocol described by the manufacturer. The quantified sample was subjected to pair-end sequencing using the HiSeq 2000 system (Illumina Inc., San Diego, CA, USA).
4.4. Sequence Data Processing/Data Filtering and De Novo Assembly
The raw reads of sequences were filtered by employing a custom perl script and Cutadapt v. 1.9 [57
]. Adaptor sequences and low quality reads (reads with >10% bases with Q ≤ 20) were removed from the resulting data. Trimming of the data involved removal of the reads that had three or more consecutive unassigned Ns with a phred score of ≤20. Sequence reads that were less than 50 bp were discarded prior to the de novo
transcriptome assembly step. The filtered data was assembled using the transcriptome assembler, SOAPdenovo-TRANS [58
] with k-mer size of 101. To generate more complete sequences with longer length, fork, bubble and complex loci from SOAPdenovo-TRANS assembly were further combined using the CAP3 assembler [59
] with 95% identity and minimum overlap of 50 bp. Furthermore, the contigs and scaffolds having a total length of less than 240 bp were omitted, as these were considered shorter than the length of a single pair of the sequence.
4.5. Transcriptome Annotation
The Doza- and Farah-derived assemblies were analysed using BLASTN [60
] and BLASTX [61
] against the nucleotide (Nt) and protein (Nr) database maintained by NCBI with the threshold E
-value of <10−10
. Both the assemblies were also searched against UniRef100 [62
] using the same threshold parameter. The assemblies were further compared by performing a nucleotide search against the genomes of related legume species against the coding DNA sequences (CDSs) and the genome of Medicago truncatula
Gaertn. (M. truncatula
Genome Project v. 4.0 [63
]), the chickpea (Cicer arietinum
L.) genome [52
], and soybean (Glycine max
L.) CDSs [51
For further analysis, those transcripts that displayed a significant match to non-plant databases based on their annotation were removed from both the assemblies. The transcripts were also BLASTN analysed against the previously-generated faba bean transcriptome databases of Kaur et al. [22
] and Webb et al. [29
]. The unannotated transcripts from both assemblies were searched for the presence of open reading frames (ORFs) using the ‘getorf’ command in the EMBOSS package [64
]. The transcripts that returned no match as part of the ORF search were analysed for the presence of annotated contigs in the alternate cultivar. The assembled faba bean transcripts were characterised on the basis of Gene Ontology (GO) using the Blast2GO PRO software program [65
] with the E
-value threshold of <10−10
4.6. Tissue-Specific Expression Analysis
The BWA-MEM software package [66
] was employed to generate tissue-specific expression profiles by aligning the reads obtained from each of the individual RNA-Seq libraries of Doza and Farah to their respective assembled transcriptome using the default parameters. The read counts were normalised as originally described by Sudheesh et al. [45
] to generate the source tissue-specific expression profile. The normalised data from the specific source tissue was classified into the three major groups, namely, reproductive (flower, pod, immature pod, and seed), vegetative (leaf and stem), and subterranean (root).
4.7. Validation of Tissue Expression Analysis
A set of twelve unigenes with differences in the level of expression were randomly selected based on their annotation and putative biological function from the three major groups as described above. RNA extractions from different tissues (leaf, stem, root, pod, and flower) of the ‘Nura’ cultivar of faba bean were performed as detailed above. The primer sequences for the selected transcripts based on their annotation NCBI’s Nr database (Table S10
) were designed using BatchPrimer3 [67
] with default parameters for the product size of 100 to 120 bp, GC content ranging from 40% to 60% and an optimum annealing temperature between 55 and 60 °C. The GADPH gene was used as an internal reference gene. The qRT-PCR, melting curve analysis and normalisation of the obtained data against the internal control was performed as described by Sudheesh et al. [50
]. The correlation between the RNA-Seq and qRT-PCR data was assessed by calculating the Pearson’s correlation coefficient in Microsoft Excel.