Characterisation of Faba Bean ( Vicia faba L.) Transcriptome Using RNA-Seq: Sequencing, De Novo Assembly, Annotation, and Expression Analysis

: RNA sequencing (RNA-Seq) is a deep sequencing method used for transcriptome proﬁling. RNA-Seq assemblies have successfully been used for a broad variety of applications, such as gene characterisation, functional genomic studies, and gene expression analysis, particularly useful in the absence of a well-studied genome reference sequence. This study reports on the development of reference unigene sets from faba bean using RNA-Seq. Two Australian faba bean cultivars (Doza and Farah) that differ in terms of disease resistance, breeding habit, and adaptation characteristics, and have been extensively used in breeding programs, were utilised in this study. The de novo assembly resulted in a total of 58,962 and 53,275 transcripts with approximately 67 Mbp (1588 bp N50) and 61 Mbp (1629 bp N50) for Doza and Farah, respectively. The generated transcripts have been compared to the protein and nucleotide databases of NCBI, as well as to the gene complements of several related legume species such as Medicago truncatula , soybean, and chickpea. Both assemblies were compared to previously-published faba bean transcriptome reference sets for the degree of completeness and utility. Annotation of unigenes has been performed, and patterns of tissue-speciﬁc expression identiﬁed. The gene complement derived from this comprehensive transcriptome analysis shows that faba bean, despite its complex 13 Gbp genome, compares well to other legumes in expressed gene content. This study in faba bean represents the most comprehensive reference transcriptomes from two different Australian cultivars available to date and it provides a valuable resource for future genomics-assisted breeding activities in this species


Introduction
Faba bean (Vicia faba L.) is a cool-season legume species, producing protein-rich grain not only for human production (particularly in Western Asia and Northern Africa), but also for livestock feed in developed regions, such as Europe and Australia [1,2].Global cultivation on 2.4 Mha in 2012 produced circa 4 Mt [3], the primary production countries being China, Ethiopia, Morocco, and Australia.
The genus Vicia belongs to the Viceae tribe within the Galegoid (cool-season) clade of the sub-family Papilionoideae, which is, in turn, part of the legume family, Fabaceae [11].Vicia contains more than 160 species, with considerable variation of haploid genome size [12,13], corresponding to ca. 7.5-fold range from ca. 1862-14,112 Mbp.The genome size of faba bean is close to the upper boundary, at ca. 13,000 Mb [14].Differential genome expansions within the Vicia genus are apparently largely attributable to amplifications of large retroelement sequences [15], although the basic genetic complement is likely to be conserved.
Although current resources are sufficient to support simple strategies for genomics-assisted breeding in faba bean, such as marker-assisted back-crossing [1,36], a transition to genomic selection [37], which depends on high-density genome-wide sequence polymorphism information will require a significant expansion of existing sequence collections.Ideally, a reference whole-genome sequence would be used in conjunction with resequencing of selected individuals from training populations.However, for large higher plant genomes, such as that of faba bean, whole genome sequence assembly remains a technically challenging proposition because of the abundance of repetitive DNA sequences.In contrast, sequencing of the transcriptome, which corresponds to the expressed proportion of the genome, provides an attractive option, especially through use of RNA-Seq technology on second-generation DNA sequencing platforms [38].
A number of transcriptome sampling studies have previously been performed for faba bean, with an emphasis on specific developmental stages or environmental conditions, or by using one type of source tissue [29,[39][40][41][42][43].However, there is an incentive to generate a more comprehensive resource through sampling of RNA populations from multiple tissue sources, allowing the construction of a transcriptome atlas, as previously described for grain legume species, such as soybean and field pea [44,45].Apart from large-scale development of gene-associated molecular genetic markers for the purposes of genomic selection, transcriptome atlases can support gene isolation, identification of differentially-regulated gene sets, and the measurement of gene expression, as well as studies of comparative genomics, and (ultimately) annotation of whole genome sequences [38,46].The current study reports on comprehensive transcriptome assemblies using RNA-Seq from two Australian faba bean cultivars (Doza and Farah) that differ in their breeding habit, adaptation characteristics, and disease resistance.Doza is resistant to rust infection and tolerant to frost events, and is best adapted to regions of New South Wales and Southern Queensland that experience warmer spring temperatures [47].Since Doza is susceptible to ascochyta blight and, therefore, has a significantly lower yield, cultivar Farah is favoured in the region of South Australia, to which it is well-adapted [48].Farah is an older cultivar (registered in 2003) [48] compared to Doza (registered in 2008) [49].Comparisons of the respective transcriptome assemblies have been made to the gene complements of several related species.The annotation of unigenes has been performed, and patterns of tissue-specific expression have been characterised.The faba bean transcriptome dataset generated in this study will provide an important resource for future genomics-assisted breeding activities in this species.

RNA-Seq and De Novo Transcriptome Assembly
A total of seven RNA-Seq libraries based on various plant tissues were generated from both of the faba bean cultivars and sequenced aiming at similar depths.However, sequencing runs from Farah generated ca.25% of more data as compared to Doza.The raw sequence data reads were trimmed to remove the adaptor sequences and subsequently filtered to exclude short read lengths and low quality reads.This resulted in a total of 776,387,394 and 1,037,821,214 paired-end reads from Doza and Farah, respectively.The details of the sequence reads from different source tissues are summarised in Table 1.The raw sequence reads were optimised by comparing different word sizes (k-mer), and a k-mer size of 101 was found to be optimal, based on the outcomes prior to the compilation of the assembly.The statistics of the sequencing data filtering and outputs are summarised in Table 2.The clean reads were assembled using SOAPdenovo-TRANS and were further compiled by CAP3 assembler.This resulted in 60,012 and 59,391 transcripts with N50 (length of the longest contig, such that all contigs of at least that length compose at least 50% of the bases of the assembly) scaffold of 1588 and 1629 bp for Doza and Farah, respectively.The distribution of the assembled contigs and scaffolds was determined (Figure 1).The majority of the transcripts were in the range of 301-400 bp (20.4% for Doza and 23.8% for Farah), followed by those that were above 2000 bp in length (13.9% on average).The longest transcript for Doza (Vf-D-scaffold38544) and Farah (Vf-F-scaffold41695) had the length of 23,428 bp and 20,384 bp, respectively.

Classification and Functional Annotation of the Faba Bean Transcriptome
For the Doza-derived assembly, the BLASTN search to Nt database and the BLASTX search to Nr and UniRef100 databases identified 43,065, 44,501, and 44,523 transcripts, respectively (Table S1).For the Farah-derived assembly, 33,768 transcripts exhibited matches to the Nt database, 38,177 transcripts to Nr database as maintained by NCBI and 38,245 transcripts to UniRef 100 database (Table S2).A total of 47,477 Doza-specific transcripts had matches to either one or more Nt, Nr, and UniRef 100 databases with 39,654 transcripts having matches to all the three databases.For Farah, 36,922 transcripts had matches to either one or more Nt, Nr, and UniRef 100 databases with 28,065 Farah-specific transcripts having matches to each of the three databases.
A total of 376 transcripts from the Doza assembly and 5851 transcripts from the Farah assembly exhibited high-value matches of moderate similarity to non-plant-derived sources.A small proportion of these anomalies was resolved by comparison with BLASTN results to the closely related legume species.Totals of 270 (from Doza) and 5753 (from Farah) non-plant-derived sequences were removed from further analysis.Both the assemblies were also assessed for the presence of repeat elements and resulted in the identification of only circa 0.5% transcripts with annotations of repeat elements including retrotransposons components such as long terminal repeats (LTRs).
The BLASTN and BLASTX analysis to the Nt and Nr databases of NCBI revealed that the faba bean sequence annotations showed highest level matches to sequences from chickpea and M. truncatula.The E-value distribution of the significant matches (with E-value <10 −50 ) from the BLAST results exhibited high levels of similarity (89.8% for Doza and 83.3% for Farah) in the Nt database and approximately 71.8% (for Doza) and 69.8% (for Farah) in both the Nr and UniRef100 databases.
Transcripts from the reference Doza and Farah transcriptome assemblies were BLASTN analysed against the CDSs and genome of M. truncatula, CDSs of soybean and the genome of chickpea (Table S3 for Doza and Table S4 for Farah).The distribution of E-values based on BLASTN results of the comparator reference legume species to the generated faba bean assemblies is summarised in Table 3

Contig length (bp)
Distribution of the assembled transcript length from the Doza-specific and Farah-specific assemblies.

Classification and Functional Annotation of the Faba Bean Transcriptome
For the Doza-derived assembly, the BLASTN search to Nt database and the BLASTX search to Nr and UniRef100 databases identified 43,065, 44,501, and 44,523 transcripts, respectively (Table S1).For the Farah-derived assembly, 33,768 transcripts exhibited matches to the Nt database, 38,177 transcripts to Nr database as maintained by NCBI and 38,245 transcripts to UniRef 100 database (Table S2).A total of 47,477 Doza-specific transcripts had matches to either one or more Nt, Nr, and UniRef 100 databases with 39,654 transcripts having matches to all the three databases.For Farah, 36,922 transcripts had matches to either one or more Nt, Nr, and UniRef 100 databases with 28,065 Farah-specific transcripts having matches to each of the three databases.
A total of 376 transcripts from the Doza assembly and 5851 transcripts from the Farah assembly exhibited high-value matches of moderate similarity to non-plant-derived sources.A small proportion of these anomalies was resolved by comparison with BLASTN results to the closely related legume species.Totals of 270 (from Doza) and 5753 (from Farah) non-plant-derived sequences were removed from further analysis.Both the assemblies were also assessed for the presence of repeat elements and resulted in the identification of only circa 0.5% transcripts with annotations of repeat elements including retrotransposons components such as long terminal repeats (LTRs).
The BLASTN and BLASTX analysis to the Nt and Nr databases of NCBI revealed that the faba bean sequence annotations showed highest level matches to sequences from chickpea and M. truncatula.The E-value distribution of the significant matches (with E-value <10 −50 ) from the BLAST results exhibited high levels of similarity (89.8% for Doza and 83.3% for Farah) in the Nt database and approximately 71.8% (for Doza) and 69.8% (for Farah) in both the Nr and UniRef100 databases.
Transcripts from the reference Doza and Farah transcriptome assemblies were BLASTN analysed against the CDSs and genome of M. truncatula, CDSs of soybean and the genome of chickpea (Table S3 for Doza and Table S4 for Farah).The distribution of E-values based on BLASTN results of the comparator reference legume species to the generated faba bean assemblies is summarised in Table 3.In total, 43,284 (72.5%)Doza-specific transcripts had matches to either one or all the three comparator Agronomy 2017, 7, 53 5 of 15 legume species, with 23,299 transcripts displaying hits to all.In the Farah assembly, 33,075 (61.7%) transcripts displayed matches to either one or all of the M. truncatula, chickpea, and soybean datasets, with 16,727 transcripts exhibiting matches to all.Table 3. E-value distribution of the faba bean transcripts BLAST results to sequences from other closely-related species.

Doza Assembly Farah Assembly
Transcripts with E-Value <10 −50   Transcripts with E-Value <10 −10   Transcripts with E-Value <10 −50   Transcripts with E-Value <10 The results of the comparison of Doza (Table S5) and Farah (Table S6) transcriptome assemblies to the previously published faba bean transcriptome sets (from Webb et al. [29] and Kaur et al. [22]) revealed that the current assemblies captured approximately 96% of transcripts from the Webb et al. [29] datasets and approximately 98% of contigs and 78% of singletons from the Kaur et al. [22] study.A total of 25.7% and 35.4% transcripts were found to be specific to the Doza and Farah transcriptome assemblies, respectively, based on comparison to the previously-published faba bean datasets.Reciprocal reference read mapping (Table S7) of Doza to Farah revealed that 92.2% of Doza-derived transcripts matched Farah-derived sequences.The reciprocal read mapping of Farah to Doza showed that 81.2% of Farah-derived transcripts exhibited matches to Doza-derived sequences.Furthermore, an overall comparison results of Farah versus Doza in addition to the datasets from Kaur et al. [22] and Webb et al. [29] studies are summarised in Figure 2.This identified a total of 8428 Doza-specific transcripts, 9437 Farah-specific transcripts and 9572 transcripts common to both the assemblies that was not previously characterised in Vicia faba L. datasets.
Agronomy 2017, 7, 53 5 of 15 all the three comparator legume species, with 23,299 transcripts displaying hits to all.In the Farah assembly, 33,075 (61.7%) transcripts displayed matches to either one or all of the M. truncatula, chickpea, and soybean datasets, with 16,727 transcripts exhibiting matches to all.The results of the comparison of Doza (Table S5) and Farah (Table S6) transcriptome assemblies to the previously published faba bean transcriptome sets (from Webb et al. [29] and Kaur et al. [22]) revealed that the current assemblies captured approximately 96% of transcripts from the Webb et al. [29] datasets and approximately 98% of contigs and 78% of singletons from the Kaur et al. [22] study.A total of 25.7% and 35.4% transcripts were found to be specific to the Doza and Farah transcriptome assemblies, respectively, based on comparison to the previously-published faba bean datasets.Reciprocal reference read mapping (Table S7) of Doza to Farah revealed that 92.2% of Doza-derived transcripts matched Farah-derived sequences.The reciprocal read mapping of Farah to Doza showed that 81.2% of Farah-derived transcripts exhibited matches to Doza-derived sequences.Furthermore, an overall comparison results of Farah versus Doza in addition to the datasets from Kaur et al. [22] and Webb et al. [29] studies are summarised in Figure 2.This identified a total of 8428 Doza-specific transcripts, 9437 Farah-specific transcripts and 9572 transcripts common to both the assemblies that was not previously characterised in Vicia faba L. datasets.To obtain gene function categories of the transcripts generated from Doza and Farah assemblies, Gene Ontology (GO) terms were assigned based on the sequence similarity to Nr databases.The analysis revealed a total of 30,581 transcripts from Doza and 27,285 transcripts from Farah were assigned at least one GO term.BLAST searches showed the highest similarity to M. truncatula, followed by chickpea (Figure S1).GO assignment to the biological process category was highest (41.9%), followed by cellular function (38.3%) and molecular function (19.9%; Figure S1).Among the biological process sub-categories, metabolic process (28.7%) and cellular process (24.5%) were prominently represented (Figure S1), indicating that tissues used in this study were undergoing extensive metabolic activity.A moderate number of transcripts were also involved in the single-organism process (15.9%), biological regulation (7.7%), regulation of biological process (6.6%), and response to stimulus (5.9%) categories.Under the molecular function category, catalytic activity (51.9%) and binding (48.1%) were the most common (Figure S1).For the cellular component category, the majority of the transcripts were assigned to the membrane (18.5%),cell (20.4%), cell part (20.1%), and organelle (13.8%) categories, while much smaller proportions were assigned to the organelle part and macromolecular complex categories (Figure S1).
A total of 50,506 (84.5%)Doza-specific transcripts and 40,462 (75.4%)Farah-specific transcripts exhibited matches to either one or all of the closely-related legume species.Following this process, totals of 2311 Doza-specific and 2511 Farah-specific uncharacterised transcripts were found to be annotated based on the BLAST results against Nt, Nr, and UniRef100 databases.However, 6925 transcripts (11.6%) from Doza and 10,665 transcripts (19.9%) from Farah were found to be uncharacterised based on these databases.The unannotated subsets from each assembly were searched for the presence of open reading frames (ORFs), which identified 5508 and 8592 transcripts from Doza and Farah, respectively.The reciprocal searches identified additional sets of 637 Doza-specific transcripts and 2436 Farah-specific transcripts.For Doza, the sequence length of the uncharacterised transcripts varied from 276 bp to 1355 bp with an average of 380 bp.For Farah, the sequence length of the uncharacterised transcripts ranged from 248 bp to 1142 bp with an average length of 398 bp.Hence, a total of 780 sequences from Doza and 363 sequences from Farah were removed from the final assemblies.The statistics of the final reference set for Doza and Farah is summarised in Table 4.

Tissue-Specific Expression Analysis
Expression patterns within the faba bean transcript assemblies were analysed by aligning the sequence reads obtained from individual libraries to the cultivar-specific assembled transcriptome followed by normalization to the 75th percentile.It was found that 48,432 Doza-specific and 48,596 Farah-specific transcripts were present in all of the types of source tissue used in this study (Table S8).However, the expression level of these common transcripts varied significantly from one source tissue to another.For example, flower tissues from both the cultivars exhibited higher levels of transcripts corresponding to pollen-specific proteins, and leaf and stem tissue comprised of high levels of transcripts that had annotations to proteins involved in chlorophyll synthesis and photosynthesis.Similarly, in immature seeds and pods, the expression of seed-storage proteins, such as vicilin, convicilin, legumin, and embryonic abundant protein, was enriched.In contrast, only a small proportion of transcripts (352 in Doza and 192 in Farah) were expressed exclusively in one tissue-type group.For the Doza-specific assembly, a large number of transcripts were expressed in stem tissue, while immature seeds exhibited a lower number of expressed transcripts (Figure 3).Conversely, for the Farah-specific assembly, the level of the expressed transcripts was higher in the flower tissue and at the lowest value for leaf tissue.
as vicilin, convicilin, legumin, and embryonic abundant protein, was enriched.In contrast, only a small proportion of transcripts (352 in Doza and 192 in Farah) were expressed exclusively in one tissue-type group.For the Doza-specific assembly, a large number of transcripts were expressed in stem tissue, while immature seeds exhibited a lower number of expressed transcripts (Figure 3).Conversely, for the Farah-specific assembly, the level of the expressed transcripts was higher in the flower tissue and at the lowest value for leaf tissue.Levels of expression were evaluated for 12 randomly-selected transcripts to analyse the level of expression among tissues from the transcriptome assembly by qRT-PCR.The transcripts associated with a range of functions in roots (peroxidase, lipoxygenase), pods and seeds (legumin, convicilin), flowers (pectinesterase, pollen-specific pectin methylesterase and leucine-rich repeat extension) and leaves and stems (photosynthesis-related proteins) were evaluated for the level of expression.The majority of the selected transcripts (10 out of 12) displayed good correlations between expression levels assessed by qRT-PCR and RNA-Seq (average Pearson's correlation coefficient of 0.999; Table S9).The remaining two transcripts showed only slight deviation from perfect concordance, with correlation coefficient values of 0.935 (discordant outcome for expression in pods) and 0.916 (discordant outcome for expression in flowers) (Table S9).

De Novo Transcriptome Assembly
The present study describes the use of RNA-Seq to obtain comprehensive transcriptome assemblies from two Australian cultivars of faba bean (Doza and Farah) that differ in terms of breeding habit, adaptation characteristics, and disease resistance.Well-structured reference transcriptome assemblies for these cultivars were generated and characterised, with the aim of improving the application of marker-assisted selection strategies in breeding programs, in the absence of a complete faba bean whole-genome assembly.
RNA-Seq libraries from both Doza and Farah were prepared using multiple tissues obtained from three biological replicates and pooled before sequencing to ensure that genes expressed even at low levels were represented in the current assemblies.Sequencing from Farah generated 25% more Levels of expression were evaluated for 12 randomly-selected transcripts to analyse the level of expression among tissues from the transcriptome assembly by qRT-PCR.The transcripts associated with a range of functions in roots (peroxidase, lipoxygenase), pods and seeds (legumin, convicilin), flowers (pectinesterase, pollen-specific pectin methylesterase and leucine-rich repeat extension) and leaves and stems (photosynthesis-related proteins) were evaluated for the level of expression.The majority of the selected transcripts (10 out of 12) displayed good correlations between expression levels assessed by qRT-PCR and RNA-Seq (average Pearson's correlation coefficient of 0.999; Table S9).The remaining two transcripts showed only slight deviation from perfect concordance, with correlation coefficient values of 0.935 (discordant outcome for expression in pods) and 0.916 (discordant outcome for expression in flowers) (Table S9).

De Novo Transcriptome Assembly
The present study describes the use of RNA-Seq to obtain comprehensive transcriptome assemblies from two Australian cultivars of faba bean (Doza and Farah) that differ in terms of breeding habit, adaptation characteristics, and disease resistance.Well-structured reference transcriptome assemblies for these cultivars were generated and characterised, with the aim of improving the application of marker-assisted selection strategies in breeding programs, in the absence of a complete faba bean whole-genome assembly.
RNA-Seq libraries from both Doza and Farah were prepared using multiple tissues obtained from three biological replicates and pooled before sequencing to ensure that genes expressed even at low levels were represented in the current assemblies.Sequencing from Farah generated 25% more data as compared to Doza, which may be due to the inconsistency in the quantification of RNA-Seq libraries.The final Doza assembly resulted in 58,962 transcripts with total assembly length of 66,959,534 bp (average length 1135.6 bp and N50 of 1595), whereas the final Farah assembly comprised of 53,275 transcripts with total assembly length of 60,943,125 bp (average length 1143.9 bp and N50 of 1683).Both the Doza and Farah assemblies contained a substantial number of large transcripts with sequence length >500 bp (average 67.3%), which was comparable to the results obtained by previous studies [40,50] with a deep sequencing method for transcriptome generation.These results are highly comparable to those for the transcriptomes of other legume species, such as M. truncatula, which has a total of 66,028,174 bp (M.truncatula Genome Project v. 4.0), and soybean, which has 68,278,578 bp [51], and much higher than that of chickpea, which has 32,973,966 bp [52].The average length statistics of both the assemblies were highly comparable to those for M. truncatula and chickpea, with average lengths of 1060 bp and 1166 bp, respectively.

Annotation of the Transcriptome Assemblies
A total of 83.5% and 71.7% of the transcripts from Doza and Farah, respectively (corresponding to 27,653 and 29,151 unigenes), were annotated based on the results of BLAST against the Nr database of NCBI.Based on the number of unigenes, it can be concluded that the Farah assembly is less fragmented (23.6%) than the Doza assembly (37.9%).The number of unigenes obtained from this study is highly comparable to those from other legume genomes including chickpea (28,269 gene models [52]) and lentil (27,396 unigenes based on the Nr database [50]).Moreover, BLASTX analysis revealed that the largest number of matches were to M. truncatula (around 43% and 34% for Doza and Farah, respectively), followed by chickpea (approximately 22% for Doza and 16% for Farah), and then any other plant species.This result is consistent with known phylogenetic relationships, as faba bean is more closely related to M. truncatula than to chickpea [12].A small fraction of sequences in Doza (270), and a comparatively higher fraction in Farah (5753), displayed similarity matches to non-plant sequences based on BLAST results against NCBI and UniRef100 databases, probably due to the presence of contaminating seed-borne microbes and microbial communities in the rhizosphere, soil, and in planta.
Several transcriptome datasets have been previously generated for faba bean, from the cultivars Windsor [53], Icarus and Ascot [22], CDC Fatima, SSNS-1, and A01155 [39], Fiord [40,54], INRA-29H and Vf136 [42], BPL10 and Albus [29], and Wizard [43].These datasets were generated from a selection of tissues including root, shoots, seedling, embryo, seed coat, leaves, or mixed tissues.The reference transcriptome assemblies from the current study were compared to the datasets obtained from the transcriptome studies of Webb et al. [29] and Kaur et al. [22].The former was chosen as it has been used to generate a genetic linkage map with the densest SNP coverage that is currently available, while the latter used Australian cultivars, and a mixture of tissue types providing a broad survey of gene expression diversity.The transcriptome assembly from the present study was generated using the Illumina HiSeq 2000 second-generation sequencing system, while both of the previously-published faba bean datasets were obtained from GS FLX/454 reads.A high proportion of transcripts (96.5%) from Webb et al. [29] was captured in the current study, and 98% of contigs and 78% singletons from [22] were represented in the current reference transcriptome assembly.The remaining singletons (approximately 22%) that were not captured in the current assemblies could be specific to the Icarus/Ascot cultivars, or sequencing assembly artefacts.The two faba bean cultivars used in this study differ in terms of breeding habit and disease resistance characteristics.Doza differs from Farah in the response to biotic and abiotic stresses with Doza being resistant to rust infection and tolerant to frost events, whereas Farah is resistant to ascochyta blight.In addition to this, Farah is considered as the better yielding variety, compared to Doza.Reciprocal sequence analysis revealed an average of 14.2% transcripts that displayed no significant match to any transcript in the other cultivar which may account for some of the observed characteristics between the two faba bean varieties.
In total, 84.5% of Doza-specific transcripts and 75.4% of Farah-specific transcripts were annotated based on the comparisons to NCBI's Nt and Nr databases, UniRef100, comparator legume species genomes, and previously-published faba bean datasets.GO analysis revealed a total of 51.2% transcripts from Doza and 50.1% from Farah were assigned at least one GO term which is comparable with other similar studies published in the literature [23,50].The unannotated sequences (approximately 20.1%) were evaluated for the presence of ORFs and further for the similarity searches in the alternate cultivar that had annotations and the presence of ORF.The genes that were not annotated were more likely to be cultivar-specific and were included in the final assemblies for the completeness of the study as a previous study by Sudheesh et al. [50] also reported a similar percentage (21%) of unannotated genome components in their transcriptome study.

Analysis and Validation of Tissue-Specific Gene Expression Level
RNA-Seq libraries from multiple tissue sources were sequenced to obtain differential tissue gene expression levels based on the approach previously used in field pea and lentil [45,50].Significant proportions of transcripts (96.3% in Doza and 96.7% in Farah) were expressed in more than one source tissue, but showed varying expression levels.The differential expression level from a selected set of transcripts was used to examine differential tissue expression obtained from the Doza and Farah transcriptome assemblies, and found to show good consistency in terms of the Pearson correlation coefficient.However, only two of twelve selected sequences failed to show concordance, possibly due to primer-design issues in the qRT-PCR test, or the effect of a complementary expression profile from a paralogous gene sequence.

Applications to Genomics-Assisted Breeding
A comparison of the Doza-derived and Farah-derived transcriptomes will permit identification of SNPs that discriminate between these cultivars, and which may be shared with other, particularly closely-related, germplasm.A sufficient number of genome-wide-distributed SNPs will be suitable for use in genome-wide association studies to identify genomic regions controlling key traits, and in genomic selection.SNPs can be formatted for detection at a number of different levels of the multiplex ratio, but the most obvious approach to implementation in a species such as faba bean is genotyping-by-sequencing (GBS) in the absence of a reference faba bean genome [55].A GBS methodology based on sampling of the expressed component of the genome by RNA-Seq has been previously reported [56].The unigene sets described in the present study will be a valuable resource for implementation of such a method, which ideally uses alignment to a high-quality reference assembly.

Plant Material
Three plants from each of the Doza and Farah cultivars were used in this study.The plants were germinated and maintained in standard potting mix in 200 mm plastic pots at 22 ± 2 • C with a photoperiod of 16/8-h (light/dark) within the glasshouse at AgriBio, Bundoora, Victoria, Australia.To prevent any problems with cross-pollination, the plants were isolated through the use of net enclosures during periods of flowering.Multiple tissue sources of plant material from both cultivars were harvested at various time points, in three replicates.Stem and leaf tissues from multiple nodes, along with roots, were sampled from four-week-old plants.Immature pods and fully-open flowers were collected within 8-12 days after flowering.Pods and immature seeds were sampled within 18-23 days post-flowering.The harvested tissue was snap-frozen in liquid nitrogen and stored at −80 • C until RNA extraction was performed.

RNA Extraction
The three replicates from each of the different source tissues were combined in equimolar quantities prior to grinding of tissue for the RNA isolation step to minimise variability across biological replicates.Total RNA was extracted and treated with DNase I (Qiagen, Hilden, Germany) using the RNeasy ® Plant Mini Kit (Qiagen) following the manufacturer's protocol.The isolated total RNA samples were quantified using a spectrophotometer (Thermo-Scientific, Wilmington, DE, USA) at the wavelength ratios of A260/230 and A260/230.The extracted samples were resolved on 1.2% (w/v) denaturing agarose gel to assess the integrity of RNA.

Library Preparation and Sequencing
RNA-Seq libraries were prepared with the SureSelect Strand-Specific RNA Library Kit according to the protocol described by the manufacturer, with the exception of the poly(A) RNA fragmentation time.The purified poly(A) RNA was fragmented to an approximate insert size of 350 bp at 94 • C for a minute, instead of 8 min as recommended in the protocol.The libraries were assessed on an Agilent TapeStation 2200 platform with D1000 ScreenTape (Agilent Technologies, Santa Clara, CA, USA) following the manufacturer's protocol.Each library was prepared with a unique indexing primer, and all the libraries were multiplexed in an equimolar concentration to generate a single pool.The multiplexed pooled sample was quantified using a KAPA library quantification kit (KAPA Biosystems, Boston, MA, USA) according to the protocol described by the manufacturer.The quantified sample was subjected to pair-end sequencing using the HiSeq 2000 system (Illumina Inc., San Diego, CA, USA).

Sequence Data Processing/Data Filtering and De Novo Assembly
The raw reads of sequences were filtered by employing a custom perl script and Cutadapt v. 1.9 [57].Adaptor sequences and low quality reads (reads with >10% bases with Q ≤ 20) were removed from the resulting data.Trimming of the data involved removal of the reads that had three or more consecutive unassigned Ns with a phred score of ≤20.Sequence reads that were less than 50 bp were discarded prior to the de novo transcriptome assembly step.The filtered data was assembled using the transcriptome assembler, SOAPdenovo-TRANS [58] with k-mer size of 101.To generate more complete sequences with longer length, fork, bubble and complex loci from SOAPdenovo-TRANS assembly were further combined using the CAP3 assembler [59] with 95% identity and minimum overlap of 50 bp.Furthermore, the contigs and scaffolds having a total length of less than 240 bp were omitted, as these were considered shorter than the length of a single pair of the sequence.

Transcriptome Annotation
The Doza-and Farah-derived assemblies were analysed using BLASTN [60] and BLASTX [61] against the nucleotide (Nt) and protein (Nr) database maintained by NCBI with the threshold E-value of <10 −10 .Both the assemblies were also searched against UniRef100 [62] using the same threshold parameter.The assemblies were further compared by performing a nucleotide search against the genomes of related legume species against the coding DNA sequences (CDSs) and the genome of Medicago truncatula Gaertn.(M.truncatula Genome Project v. 4.0 [63]), the chickpea (Cicer arietinum L.) genome [52], and soybean (Glycine max L.) CDSs [51].
For further analysis, those transcripts that displayed a significant match to non-plant databases based on their annotation were removed from both the assemblies.The transcripts were also BLASTN analysed against the previously-generated faba bean transcriptome databases of Kaur et al. [22] and Webb et al. [29].The unannotated transcripts from both assemblies were searched for the presence of open reading frames (ORFs) using the 'getorf' command in the EMBOSS package [64].The transcripts that returned no match as part of the ORF search were analysed for the presence of annotated contigs in the alternate cultivar.The assembled faba bean transcripts were characterised on the basis of Gene Ontology (GO) using the Blast2GO PRO software program [65] with the E-value threshold of <10 −10 .

Tissue-Specific Expression Analysis
The BWA-MEM software package [66] was employed to generate tissue-specific expression profiles by aligning the reads obtained from each of the individual RNA-Seq libraries of Doza and Farah to their respective assembled transcriptome using the default parameters.The read counts were normalised as originally described by Sudheesh et al. [45] to generate the source tissue-specific expression profile.The normalised data from the specific source tissue was classified into the three major groups, namely, reproductive (flower, pod, immature pod, and seed), vegetative (leaf and stem), and subterranean (root).

Validation of Tissue Expression Analysis
A set of twelve unigenes with differences in the level of expression were randomly selected based on their annotation and putative biological function from the three major groups as described above.RNA extractions from different tissues (leaf, stem, root, pod, and flower) of the 'Nura' cultivar of faba bean were performed as detailed above.The primer sequences for the selected transcripts based on their annotation NCBI's Nr database (Table S10) were designed using BatchPrimer3 [67] with default parameters for the product size of 100 to 120 bp, GC content ranging from 40% to 60% and an optimum annealing temperature between 55 and 60 • C. The GADPH gene was used as an internal reference gene.The qRT-PCR, melting curve analysis and normalisation of the obtained data against the internal control was performed as described by Sudheesh et al. [50].The correlation between the RNA-Seq and qRT-PCR data was assessed by calculating the Pearson's correlation coefficient in Microsoft Excel.

Conclusions
In conclusion, the present study contributes to a total of 26,295 transcripts, with 7648 Doza-specific and 9075 Farah-specific transcripts, which have not been characterized previously in Vicia faba.The validation results confirm that the transcriptome assemblies are the most comprehensive to be generated for faba bean, and will be of significant value in genomics-assisted breeding, as well as a support for functional genomics studies.

Supplementary Materials:
The following are available online at www.mdpi.com/2073-4395/7/3/53/s1, Figure S1.Gene Ontology (GO) results of Doza and Farah transcripts using the Blast2GO PRO software program; Table S1.Bioinformatic annotation (BLAST) of Doza reference unigene transcripts against the Nt, Nr, and UniRef100 databases; Table S2.Bioinformatic annotation (BLAST) of Farah reference unigene transcripts against the Nt, Nr, and UniRef100 databases; Table S3.Bioinformatic annotation (BLAST) of the Doza reference unigene transcripts against comparator legume species; Table S4.Bioinformatic annotation (BLAST) of the Farah reference against comparator legume species; Table S5.Bioinformatic comparison (BLAST) of the Doza reference transcripts with the other faba bean datasets [22,29]; Table S6.Bioinformatic comparison (BLAST) of Farah reference transcripts with the other faba bean datasets [22,29]; Table S7.Reciprocal read mapping results of the reference transcriptome assemblies; Table S8.Transcript expression for each tissue-the normalised read count for the individual libraries after alignment to Doza and Farah reference assemblies; Table S9.Expression profiles of selected transcripts obtained from qRT-PCR and RNA-Seq from different tissues and the Pearson correlation between expressions measured by qRT-PCR and RNA-Seq analysis; Table S10.Primers used in qRT-PCR for the transcriptome assembly validation.Sequence data has been deposited at DDBJ/EMBL/GenBank under the BioProject ID PRJNA395480.

Figure 3 .
Figure 3. Differences in the level of gene expression observed from various tissue samples for Doza and Farah reference transcriptome assemblies.

Figure 3 .
Figure 3. Differences in the level of gene expression observed from various tissue samples for Doza and Farah reference transcriptome assemblies.

Table 1 .
Summary details of the reads (paired-end) used for de novo transcriptome assembly.

Table 2 . Statistics of sequencing outputs and assembly. Primary Assembly Statistics-Doza Statistics-Farah SOAPdenovo-Trans
. In total, 43,284 (72.5%)Doza-specific transcripts had matches to either one or

Table 3 .
E-value distribution of the faba bean transcripts BLAST results to sequences from other closely-related species.

Table 4 .
Statistics of the filtered Doza and Farah datasets.