Transcriptome Assembly of the Bast Fiber Crop , Ramie , Boehmeria nivea ( L . ) Gaud . ( Urticaceae )

Ramie (Boehmeria nivea) is a perennial crop valued for its strong bast fibers. Unlike other major bast fiber crops, ramie fiber processing does not include retting, but does require degumming, suggesting distinctive features in pectin and the development and composition of fibers. A comprehensive transcriptome assembly of ramie has not been made available, to date. We obtained the sequence of RNA transcripts (RNA Seq) from the apical region of developing ramie stems and combined these with reads from public databases for a total of 157,621,051 paired-end reads (30.3 billion base pairs Gbp) used as input for de novo assembly, resulting in 70,721 scaffolds (≥200 base pairs (bp); N50 = 1798 bp). As evidence of the quality of the assembly, 36,535 scaffolds aligned to at least one Arabidopsis protein (BLASTP e-value ≤ 10−10). The resource described here for B. nivea will facilitate an improved understanding of bast fibers, cell wall, and middle lamella development in this and other comparative species.


Introduction
Ramie (Boehmeria nivea (L.) Gaud.) is as a perennial, herbaceous, semi-tropical species in the Urticaceae family.It has been grown for at least 6000 years in Asia.Approximately 70,000 ha are now cultivated worldwide (primarily in China), with 124,000 tonnes harvested in 2013 (http://faostat3.fao.org/)[1].Ramie is valued primarily for its long, strong phloem (i.e., bast) fibers, which are used in textiles, most often as a blend with other types of fibers.There is also growing interest in other uses of the fibers, such as in composite materials [2].Ramie fibers have similar properties to other commercial bast fibers [3,4].While linen and hemp harvesting rely on the microbial retting (degradation) of pectins, ramie is normally extracted by decorticating fresh stems to produce bark ribbons that are subsequently dried, scraped, and de-gummed.Extracts of various parts of the ramie plant have also been reported to have beneficial medicinal properties, including antiviral and anti-inflammatory activities [5][6][7].
Research on ramie has included descriptions of specific genes including transcription factors, expansins, and cellulose synthases, as well as transcriptome-level responses related to water deficit, heavy metals, nematodes, hormones, and developmental processes [8][9][10][11][12][13][14].However, no assembled transcriptome of ramie has been reported to date.We present here an assembly of all available transcript sequence reads of the B. nivea cultivar Zhongzhu 1, including novel RNA sequencing reads generated in our laboratory from the apex of developing shoots.

Materials and Methods
Seeds of B. nivea cv.Zhongzhu 1 were obtained from the Institute of Bast Fiber Crops (Changsha, China), and grown under natural light in a glass house at the University of Alberta (Edmonton, AB, Canada).When plants were approximately 1 m high, the top 20 cm of stems of three plants were harvested.Leaves that were larger than 10 mm (including petioles) were removed, the stem segments were homogenized in liquid nitrogen, and RNA was extracted using a modified CTAB extraction combined with an RNeasy Kit (Qiagen, Valencia, CA, USA) [15].A total of 10 µg of RNA was sent to the Beijing Genome Institute (BGI) (BGI Inc., Shenzhen, China) for library preparation and paired-end transcriptome sequencing, as previously described [16].Briefly, poly-A + mRNA was isolated using oligod T coupled magnetic beads, and this mRNA was used as a template for random hexamer-primed first strand cDNA synthesis, followed by second strand synthesis using E. coli DNA Pol I. Double-stranded cDNA was sheared with a nebulizer, end repaired, and ligated to Illumina PE adapter oligos.Subsequently, 200 base pair (bp) fragments were selected by gel purification, and these were PCR-amplified for 15 cycles sequencing using an Illumina Genome Analyzer II (San Diego, CA, USA) with 75 bp, paired-end reads.Raw reads from all runs were filtered to remove adapter sequences, contamination, and low-quality reads, and the filtered raw reads were deposited in the SRA archive as ERR364387.
The length of sequence read substrings used in assembly is defined as the k-mer.Reads were used as input for de novo assembly using a k-mer sweep (k = 35, 39, 45, 49, 55, 59, 65, 69), and all of the reads except ERR364387 were additionally used with k = 75, 79, 85, 89 [17,18].The scaffold file output for each k-mer was merged with Trans-Abyss, which is a software package that uses de Bruijin graphs to assemble sequence reads [19].The resulting merged scaffolds were further assembled using CAP3 [19] (minimum overlap < 100 bp), which joins together fragments based on shared sequences.This was followed by redundancy reduction with CD-HIT-EST, which clusters together similar sequences (0.98% identity, [20]).Reads from each of the original libraries were mapped to the assembled scaffolds using default parameters with Bowtie 2, which uses an implementation of the Burrows-Wheeler Transform to align reads to a reference sequence.

Assembly
RNA was obtained from the apical-most 20 cm of stems of B. nivea.This segment included xylem and phloem fibers in a gradient of developmental stages, ranging from specification to fiber cell wall thickening.The RNA molecules were reverse transcribed and the resulting DNA was sequenced using paired-end technology, which allows sequences to be obtained from both ends of each DNA molecule.In total, 813 Mbp (mega base pairs) of sequence were obtained from 5,571,079 PE (paired-end) reads of 75 bp (base pairs) each, from which low-quality reads and adaptors had been removed, and the reads were submitted to the NCBI Sequence Read Archive (SRA) as accession ERR364387.These reads were used along with other publicly available reads to provide a total of 30.3 Gbp (giga base pairs) of input for the assembly (Table 1).
Using ABySS, all of the reads in Table 1 were used as input for de novo assembly using a k-mer sweep, with the output merged using Trans-Abyss [17], followed by further assembly in CAP3 [19] (minimum overlaps <100 bp).Redundant assembly units were removed using CD-HIT-EST, [20], resulting in a final assembly that contained 70,721 scaffolds with a minimum of 200 bp in length, with a scaffold N 50 of 1798 bp, and maximum scaffold length of 22,363 bp (Table 2, Supplemental File 1).
To evaluate the quality of the assembly, reads from each of the original libraries (Table 1) were mapped to the assembled scaffolds using Bowtie 2, resulting in overall alignment rates ranging between 78% (for ERR364387) and 98% (for SRR546782) [21].The distribution of scaffolds lengths in the assembly is shown in Figure 1.   1  77,022,208 1 Only scaffolds with a length of ≥200 bp are counted here.

Analysis of Assembly Quality
To further evaluate the assembly, BLASTX was used to align 70,721 scaffolds to the complete set of predicted Arabidopsis proteins [23,24].As a result, 36,535 scaffolds aligned to at least one Arabidopsis protein, with a minimum significance of e-value 10 −10 .Conversely, at the same significance threshold, 20,835 out of 27,416 Arabidopsis proteins aligned to at least one scaffold from the B. nivea assembly.The top Arabidopsis match for each of the BLASTX alignments was used with the Gene Ontology (GO) Annotation Search, Functional Categorization, and Download Tool to categorize the scaffolds according to their cellular component, molecular function, and biological process (https://www.arabidopsis.org/tools/bulk/go/index.jsp).The distribution of GO categories assigned to the B. nivea scaffolds was generally similar to the distribution within the Arabidopsis proteins, indicating good coverage of the B. nivea genome by this assembly (Figure 2).

Discussion
In evaluating the quality of this assembly, we considered first the number and length of scaffolds, where each scaffold presumably represents a transcript of a gene (Figure 1, Table 1).The N 50 value of this assembly is 1798 bp, meaning that 50% of the assembly is made of transcripts with 1798 bp or larger.This is longer than the median 1089 bp gene coding sequence length reported for land plants, suggesting that many of the assembled scaffolds represent full-length transcripts [25].Furthermore, there were 43,828 transcripts in the assembly that were more than 459 bp long (Table 1), which is within the range of the number of proteins typically found in plant genomes [26].Finally, we considered the ability to assign sequences to functional categories, and found that these generally matched the distribution of functions in the model plant species, A. thaliana (Figure 2), suggesting that the B. nivea assembly we presented accurately reflects the genes encoded in these species.

Conclusions
We have described the first public transcriptome assembly of the bast fiber crop B. nivea (ramie) from 30.3 Gbp of paired-end reads resulting in 70,721 scaffolds (≥200 bp; N 50 = 1798 bp), and have demonstrated the quality of this assembly through a comparison of its predicted proteins to the Arabidopsis reference proteome, using both BLASTX alignment and GO categorization.These data will facilitate further research in bast fiber crops including ramie.

Figure 2 .
Figure 2. Gene Ontology (GO) categorization of predicted proteins among B. nivea scaffolds.The figure represents the proportion of proteins assigned to each category within the three aspects of (a) molecular function, (b) cellular compartment, and (c) biological process for both B. nivea and A. thaliana.

Table 1 .
Libraries used in transcriptome assembly.

Table 2 .
Transcriptome de novo assembly statistical summary.