Genome-Wide Identification and Characterization of Long Non-Coding RNAs from Mulberry (Morus notabilis) RNA-seq Data

Numerous sources of evidence suggest that most of the eukaryotic genome is transcribed into protein-coding mRNAs and also into a large number of non-coding RNAs (ncRNAs). Long ncRNAs (lncRNAs), a group consisting of ncRNAs longer than 200 nucleotides, have been found to play critical roles in transcriptional, post-transcriptional, and epigenetic gene regulation across all kingdoms of life. However, lncRNAs and their regulatory roles remain poorly characterized in plants, especially in woody plants. In this paper, we used a computational approach to identify novel lncRNAs from a published RNA-seq data set and analyzed their sequences and expression patterns. In total, 1133 novel lncRNAs were identified in mulberry, and 106 of these lncRNAs displayed a predominant tissue-specific expression in the five major tissues investigated. Additionally, functional predictions revealed that tissue-specific lncRNAs adjacent to protein-coding genes might play important regulatory roles in the development of floral organ and root in mulberry. The pipeline used in this study would be useful for the identification of lncRNAs obtained from other deep sequencing data. Furthermore, the predicted lncRNAs would be beneficial towards an understanding of the variations in gene expression in plants.


Introduction
Mulberry (Morus notabilis) belongs to the genus Morus, which comprises 10-13 species and over 1000 cultivars distributed throughout Asia, Africa, Europe, and North America [1,2], and are well known for their important economic and medicinal values [3]. In China, mulberry leaves have been used to feed silkworms for silk production [4], and its fruit is either eaten fresh or widely used in the production of juice, wine, jam and canned food [5]. In addition, the root, bark, branch, leaf, and fruit of mulberry have been used for protecting liver, improving eyesight, treating fever, facilitating urination, and lowering blood pressure due to their high levels of isoprenylated flavonoids, such as sanggenon-type flavanones, Diels-Alder adducts, and flavones [6][7][8]. Previous studies have suggested that secondary metabolism products and some small molecule modulators might play critical roles

The Pipeline to Identify lncRNAs from RNA-seq Data
A set of Morus notabilis clean RNA-seq data with a length of 90bp and taken from five different tissues was obtained from a published study [28] and downloaded from the NCBI SRA website with the project number SRX504906. The protein-coding genes of RefSeq [29], Ensembl [30], UCSC [31], and Vega [32] were downloaded from the UCSC genome browser and all known noncoding genes from the NONCODE4.0 database [33]. The mulberry reference genome and gene model annotation files were downloaded from the genome website [28], and a pipeline was developed to identify putative lncRNAs ( Figure 1). Mapped reads from TopHat for each tissue were assembled for each sample separately by Cufflinks [35]. The cufflinks employed spliced read information to determine exon connectivity. Specifically, it used a probabilistic model approach to assemble and quantify the expression level of a minimal set of isoforms and provided the maximum level of annotation on the expression data for given loci. Cufflinks version 2.1.1 was run with default parameters (except "min-frags-per-transfrag = 0"). The multiple assembled transcript files for different tissues were then merged together to produce a unique transcriptome set using Cuffmerge.
We then used an analysis process to minimize false positives and maximize the number of lncRNAs from the merged transcripts, which included the following steps: (1) compare the merged transcripts with known protein-coding genes and lncRNAs in the public databases; (2) select transcripts that are longer than 200 nt; and (3) filter the putative lncRNA transcripts by coding potential using CNCI software [36], which can be categorized as noncoding (CNCI is a powerful signature tool that profiles adjoining nucleotide triplets to effectively distinguish protein-coding and non-coding sequences independent of known annotations) [37].

Calculation of lncRNA Conservation
To further demonstrate the reliability of lncRNAs predicted from the RNA-seq data and calculate the conservation of the novel lncRNAs, a set of lncRNAs collected by TAIR [38] and PlncDB [39] was downloaded and then aligned with the sequences of novel mulberry lncRNAs using BLASTN software [40].

Expression Profiles of Tissue Specific lncRNAs and Functional Predictions
To evaluate the tissue specificity of a transcript, we devised an entropy-based method to quantify the similarity between a transcript's expression pattern and another predefined pattern, which represented an extreme case where a transcript was expressed in only one tissue [41]. After obtaining the lncRNA dataset with tissue-specific expression, we further searched the genomic location information from the genome comparison results by running a script with Perl, and retrieved the information of coding genes within the scope of its ±10 Kb.

qRT-PCR Analysis of lncRNAs
Three individuals of mulberry were used as biological replicates. Tissues from bark, root and winter bud were isolated with a sharp chisel and frozen immediately in liquid nitrogen. Total RNA was extracted with Universal Total RNA Kit (BioTeke, Beijing, China). First-strand cDNA synthesis was carried out with approximately 1.0 μg RNA using the Prime Script™ RT Master Mix (Takara, Dalian, China). All primers used in this study are listed in Supplementary Materials Table S1. Real- After filtering out low-quality reads, the spliced read aligner TopHat version 2.0.9 [34] was used to map all clean reads to the mulberry genome. We used two rounds of TopHat mapping to maximize the usage of the splice junction information from all RNA-seq data. In the first round, all reads were mapped with TopHat (parameters: min-anchor = 5, min-isoform-fraction = 0, and other parameters with default values); in the second round of TopHat remapping, all splice junctions produced by the initial mapping were fed into TopHat to map reads (parameters: raw-juncs, no-novel-juncs, and min-anchor = 5, and min-isoform-fraction = 0).
Mapped reads from TopHat for each tissue were assembled for each sample separately by Cufflinks [35]. The cufflinks employed spliced read information to determine exon connectivity. Specifically, it used a probabilistic model approach to assemble and quantify the expression level of a minimal set of isoforms and provided the maximum level of annotation on the expression data for given loci. Cufflinks version 2.1.1 was run with default parameters (except "min-frags-per-transfrag = 0"). The multiple assembled transcript files for different tissues were then merged together to produce a unique transcriptome set using Cuffmerge.
We then used an analysis process to minimize false positives and maximize the number of lncRNAs from the merged transcripts, which included the following steps: (1) compare the merged transcripts with known protein-coding genes and lncRNAs in the public databases; (2) select transcripts that are longer than 200 nt; and (3) filter the putative lncRNA transcripts by coding potential using CNCI software [36], which can be categorized as noncoding (CNCI is a powerful signature tool that profiles adjoining nucleotide triplets to effectively distinguish protein-coding and non-coding sequences independent of known annotations) [37].

Calculation of lncRNA Conservation
To further demonstrate the reliability of lncRNAs predicted from the RNA-seq data and calculate the conservation of the novel lncRNAs, a set of lncRNAs collected by TAIR [38] and PlncDB [39] was downloaded and then aligned with the sequences of novel mulberry lncRNAs using BLASTN software [40].

Expression Profiles of Tissue Specific lncRNAs and Functional Predictions
To evaluate the tissue specificity of a transcript, we devised an entropy-based method to quantify the similarity between a transcript's expression pattern and another predefined pattern, which represented an extreme case where a transcript was expressed in only one tissue [41]. After obtaining the lncRNA dataset with tissue-specific expression, we further searched the genomic location information from the genome comparison results by running a script with Perl, and retrieved the information of coding genes within the scope of its˘10 Kb.

qRT-PCR Analysis of lncRNAs
Three individuals of mulberry were used as biological replicates. Tissues from bark, root and winter bud were isolated with a sharp chisel and frozen immediately in liquid nitrogen. Total RNA was extracted with Universal Total RNA Kit (BioTeke, Beijing, China). First-strand cDNA synthesis was carried out with approximately 1.0 µg RNA using the Prime Script™ RT Master Mix (Takara, Dalian, China). All primers used in this study are listed in Supplementary Materials Table S1. Real-time qRT-PCR was performed in quadruplicate using the SYBR Premix Ex Taq™ II Kit (Takara, Dalian, China) on a Roche light Cycler 480 (Roche Applied Science, Penzberg, Upper Bavaria, Germany) according to the manufacturer's instructions. Sample cycle threshold (Ct) values were determined and standardized relative to the endogenous control genes ACTIN3, and the 2-∆∆CT method was used to calculate the relative changes in gene expression based on the qRT-PCR data [42].

Transcripts Reconstruction and Identification of Novel lncRNAs
The RNA-seq data used in this study were downloaded from the NCBI SRA website. These reads were paired and both lengths were 90 nt. Starting from a total of 1.2 billion reads, we performed short read gapped alignment using TopHat [34] and recovered 1.01 billion (84%) mapped reads (Table 1). We then used Cufflinks [35] to de novo reconstruct one set of transcripts for each tissue based on the read-mapping results. Transcripts reconstructed were separately merged into combined sets of transcripts using the Cuffcompare utility provided by Cufflinks. After filtering for exon number, transcript length, and coverage, we obtained 41,042 reliably expressed transcripts (Table 2). To assess the robustness of these ab initio assemblers, we analyzed their performance on protein-coding genes. The transcripts we reconstructed using Cufflinks covered 70.79% of known mulberry coding genes ( Figure 2). These results strongly supported the fact that these assembly approaches could robustly and reliably reconstruct both coding and noncoding transcripts at a global level.   Based on the robust transcript reconstruction and broad availability of deep sequencing datasets, we used an analysis process to minimize the false positives and maximize the number of lncRNA transcripts, compared the merged transcripts with known protein-coding genes and lncRNAs in the public databases, and classified the combined transcripts into several different subsets. The majority of the transcripts (53.44%) corresponding to the annotated protein-coding genes, while the rest of the transcripts were undefinable (23.64%), and potentially novel (22.92%). The potentially novel transcripts were then filtered for coding potential based on CNCI software [43], resulting in the identification of 1133 reliably expressed lncRNAs with length >200 nt ( Figure 3). To assess the robustness of these ab initio assemblers, we analyzed their performance on proteincoding genes. The transcripts we reconstructed using Cufflinks covered 70.79% of known mulberry coding genes ( Figure 2). These results strongly supported the fact that these assembly approaches could robustly and reliably reconstruct both coding and noncoding transcripts at a global level.
Based on the robust transcript reconstruction and broad availability of deep sequencing datasets, we used an analysis process to minimize the false positives and maximize the number of lncRNA transcripts, compared the merged transcripts with known protein-coding genes and lncRNAs in the public databases, and classified the combined transcripts into several different subsets. The majority of the transcripts (53.44%) corresponding to the annotated protein-coding genes, while the rest of the transcripts were undefinable (23.64%), and potentially novel (22.92%). The potentially novel transcripts were then filtered for coding potential based on CNCI software [43], resulting in the identification of 1133 reliably expressed lncRNAs with length >200 nt (Figure 3). The identified lncRNAs were classified as intergenic, intronic and antisense lncRNAs based on spatial relationships of their gene loci with protein-coding genes ( Figure 4B). The identified lncRNAs were mostly intergenic lncRNAs, with 1092 in total, accounting for 96.4% of the identified lncRNA. There were 38 intronic lncRNAs, accounting for 3.4%, and 3 antisense lncRNAs, accounting for 0.26% ( Figure 4A). The identified lncRNAs were classified as intergenic, intronic and antisense lncRNAs based on spatial relationships of their gene loci with protein-coding genes ( Figure 4B). The identified lncRNAs were mostly intergenic lncRNAs, with 1092 in total, accounting for 96.4% of the identified lncRNA. There were 38 intronic lncRNAs, accounting for 3.4%, and 3 antisense lncRNAs, accounting for 0.26% ( Figure 4A).
The identified lncRNAs were classified as intergenic, intronic and antisense lncRNAs based on spatial relationships of their gene loci with protein-coding genes ( Figure 4B). The identified lncRNAs were mostly intergenic lncRNAs, with 1092 in total, accounting for 96.4% of the identified lncRNA. There were 38 intronic lncRNAs, accounting for 3.4%, and 3 antisense lncRNAs, accounting for 0.26% ( Figure 4A).

Characterization of the Novel lncRNAs
The length distribution results showed that the novel identified 1133 lncRNAs contained 1755 transcripts mainly in the range of 200-1200 bp. The lengths of 25,902 transcripts from known coding genes were greater than that of the lncRNAs, mostly above 800 bp. The distribution results of exon numbers revealed that there were 982 single exons (3.79%) and 24,920 multi-exons (96.21%) in the

Characterization of the Novel lncRNAs
The length distribution results showed that the novel identified 1133 lncRNAs contained 1755 transcripts mainly in the range of 200-1200 bp. The lengths of 25,902 transcripts from known coding genes were greater than that of the lncRNAs, mostly above 800 bp. The distribution results of exon numbers revealed that there were 982 single exons (3.79%) and 24,920 multi-exons (96.21%) in the 25,902 transcripts from known coding genes. There were 75 single exons (4.27%) and 1680 multi-exons (95.73%) in the lncRNA 1755 transcripts, revealing a similar proportion of multi-exons to the known coding genes ( Figure 5).  In combination with all known lncRNAs, we established a comprehensive catalog of 1133 transcribed lncRNA genes. Based on the Fragments Per Kilobase of transcript per Million mapped reads (FPKM) of each transcript, calculated by "Cufflinks" "abundance estimation mode" across the five tissues, we compared the expression differences between lncRNAs and protein-coding genes. The average expression levels of lncRNAs were lower than those for protein coding genes, but In combination with all known lncRNAs, we established a comprehensive catalog of 1133 transcribed lncRNA genes. Based on the Fragments Per Kilobase of transcript per Million mapped reads (FPKM) of each transcript, calculated by "Cufflinks" "abundance estimation mode" across the five tissues, we compared the expression differences between lncRNAs and protein-coding genes. The average expression levels of lncRNAs were lower than those for protein coding genes, but lncRNAs showed a wider range of abundance, with a subset of them equally abundant to mRNAs ( Figure 6). In combination with all known lncRNAs, we established a comprehensive catalog of 1133 transcribed lncRNA genes. Based on the Fragments Per Kilobase of transcript per Million mapped reads (FPKM) of each transcript, calculated by "Cufflinks" "abundance estimation mode" across the five tissues, we compared the expression differences between lncRNAs and protein-coding genes. The average expression levels of lncRNAs were lower than those for protein coding genes, but lncRNAs showed a wider range of abundance, with a subset of them equally abundant to mRNAs ( Figure 6). Through conservation analysis we found that 112 lncRNAs from the 1133 newly identified genes had homologies in the Arabidopsis database, while only 9 lncRNAs had homologies in the poplar database (Supplementary Table S2). The homology comparison results of the novel lncRNAs of Through conservation analysis we found that 112 lncRNAs from the 1133 newly identified genes had homologies in the Arabidopsis database, while only 9 lncRNAs had homologies in the poplar database (Supplementary Table S2). The homology comparison results of the novel lncRNAs of mulberry with the mapped poplar lncRNAs confirmed the high level of homology between two sequences as 41.31% (Figure 7).

Expression Profiles of Tissue Specific lncRNAs and Functional Predictions
To assess the tissue specificity of mulberry lncRNA expression, we calculated the Jensen-Shannon tissue specificity score (JS score) [40] for each transcript using the established procedure. Using a JS score = 0.9 as a cutoff, we demonstrated that only 9.35% of the lncRNAs were tissue-specific ( Figure 8). Thus, some of the lncRNA expressions of mulberry were clearly subject to tissue

Expression Profiles of Tissue Specific lncRNAs and Functional Predictions
To assess the tissue specificity of mulberry lncRNA expression, we calculated the Jensen-Shannon tissue specificity score (JS score) [40] for each transcript using the established procedure. Using a JS score = 0.9 as a cutoff, we demonstrated that only 9.35% of the lncRNAs were tissue-specific ( Figure 8). Thus, some of the lncRNA expressions of mulberry were clearly subject to tissue dependent regulation, either at the level of transcription or degradation. Figure 7. Alignment of the nucleotide sequences of Mn_lnc_0001 and lincRNA1509. Black and white backgrounds indicate conserved and non-conserved residues, respectively.

Expression Profiles of Tissue Specific lncRNAs and Functional Predictions
To assess the tissue specificity of mulberry lncRNA expression, we calculated the Jensen-Shannon tissue specificity score (JS score) [40] for each transcript using the established procedure. Using a JS score = 0.9 as a cutoff, we demonstrated that only 9.35% of the lncRNAs were tissue-specific ( Figure 8). Thus, some of the lncRNA expressions of mulberry were clearly subject to tissue dependent regulation, either at the level of transcription or degradation.  Comparing their genomic locations with those of known mulberry coding genes, we found that among the 1133 lncRNAs, 106 (9.35%) were tissue-specific, including 82 lncRNAs adjacent (˘10 kb) to protein-coding genes (Supplementary Materials Table S3-S7). The functions annotated to the protein-coding genes mainly involved hormone signal recognition and transduction, plant secondary metabolite synthesis, energy metabolism, etc. The lncRNAs Mn_lnc_0132, Mn_lnc_0521, and Mn_lnc_0782 were specifically expressed in male flower: Mn_lnc_0132 was located near the protein-coding gene EXB28594.1 (Protein PROLIFERA). Protein PROLIFERA, a highly conserved protein, was found in all eukaryotes, and specifically expressed in populations of dividing cells in sporophytic tissues of the plant body, such as the palisade layer of the leaf and founder cells of initiating flower primordia [44]. Mn_lnc_0521 was located near EXB81017.1 (Serine/threonine-protein phosphatase PP). The PP1s were shown to play key roles in many aspects of plant growth and development, such as pollination and pollen tube development [45][46][47]. It was found that Mn_lnc_0782 was located near EXC20310.1 (Phosphoenolpyruvate/phosphate translocator PPT). Located in the plastid, PPT played a pivotal role in the regulation of leaf color, florescence, and female and male gametophyte formation [48][49][50][51]. A sulfate transporter, Mn_lnc_0714, was specifically expressed in root and located near EXC06697.1 (Sulfate transporter 1.3). It was tissue-specifically expressed and was crucial for root development and symbiotic nitrogen fixation in root nodules [52][53][54].
To validate RNA-seq results, qRT-PCR were performed for 10 randomly selected tissue-specific lncRNAs in bark. As a result, all 10 reactions generated sequence products. Remarkable higher relative quantitative expressions of the 10 lncRNA were observed in bark. However, only 2 and 4 of the 10 lncRNAs expressed in winter bud and root, respectively, but their expression levels were quite low, ranging from 1.4% to 15.4% of the expression level in bark (Figure 9). development and symbiotic nitrogen fixation in root nodules [52][53][54].
To validate RNA-seq results, qRT-PCR were performed for 10 randomly selected tissue-specific lncRNAs in bark. As a result, all 10 reactions generated sequence products. Remarkable higher relative quantitative expressions of the 10 lncRNA were observed in bark. However, only 2 and 4 of the 10 lncRNAs expressed in winter bud and root, respectively, but their expression levels were quite low, ranging from 1.4% to 15.4% of the expression level in bark (Figure 9).

Discussion
An avalanche of RNA-seq data emerged as powerful high-throughput sequencing technologies became more pervasive and user-friendly. However, systematic identification of lncRNAs was limited to only a few plant species [21,26,27,55,56], leaving most plant transcriptome sequencing data not fully explored, even though these novel molecules play important roles in a wide range of biological processes [15]. Because lncRNAs are generated by the same transcriptional machinery as mRNAs [57], no defining biochemical features could be exclusively ascribed to lncRNAs, such as a 5′ cap, 3′ polyadenylated tail, and splicing [58]. Defining lncRNAs simply on the basis of size and lack of protein-coding capability was intellectually far from satisfying. In this paper, we designed a strict computational pipeline and identified 1133 novel lncRNAs from the entire genome using a set of published mulberry next-generation RNA-seq data. The pipeline used in this study can be easily adapted to other organisms, especially for species that have not been well studied to date.

Discussion
An avalanche of RNA-seq data emerged as powerful high-throughput sequencing technologies became more pervasive and user-friendly. However, systematic identification of lncRNAs was limited to only a few plant species [21,26,27,55,56], leaving most plant transcriptome sequencing data not fully explored, even though these novel molecules play important roles in a wide range of biological processes [15]. Because lncRNAs are generated by the same transcriptional machinery as mRNAs [57], no defining biochemical features could be exclusively ascribed to lncRNAs, such as a 5 1 cap, 3 1 polyadenylated tail, and splicing [58]. Defining lncRNAs simply on the basis of size and lack of protein-coding capability was intellectually far from satisfying. In this paper, we designed a strict computational pipeline and identified 1133 novel lncRNAs from the entire genome using a set of published mulberry next-generation RNA-seq data. The pipeline used in this study can be easily adapted to other organisms, especially for species that have not been well studied to date.
The expression levels of the novel mulberry lncRNAs in root, leaf, bark, bud, and male flower were below the expression levels of mRNAs, which was consistent with findings in other species [59][60][61]. Conservation analysis found that among the 1133 lncRNAs, 112 (9.4%) had homology in the Arabidopsis database, and 9 (0.8%) had homology in the poplar lncRNA database. The low levels of conservation might be caused by the incomplete lncRNA databases of plants. The results also reflected the less restrictive factors on the evolution of lncRNAs, and thus the low conservation levels of lncRNA sequences among species, factors that reduce the possibility of forming a large family with homologous genes. Moreover, qRT-PCR was performed, and the RNA-seq results were consistent with the qRT-PCR data, providing further proof that the prediction accuracy was sufficient.
Numerous studies have shown that lncRNAs with tissue-specific expression usually had special functions [62], and the lncRNAs of higher species primarily played the biological role of cis-regulation of the adjacent genes [63][64][65]. In the analysis of tissue-specific expression, we found that 106 lncRNAs from our 1133 newly identified genes were expressed specifically in five separate tissues, among which 82 had known protein-coding genes in the range of˘10 Kb. We therefore predicted the functions of these lncRNAs by analysis of the tissue-specific expressions and the functions of adjacent coding genes. Further analysis showed that three male flower-specific lncRNAs were located adjacent to coding genes, which are related to development of floral organs. One root-specific lncRNA was located adjacent to a coding gene, which is crucial for root development and symbiotic nitrogen fixation. These results suggest that these novel lncRNAs might play important regulatory roles in the development of floral organs and root in mulberry.
Regarding the important functions of lncRNAs in plant growth and development, their identification within plant-wide genomes is rapidly developing. By contrast, the functional characterization of lncRNAs for plants is far behind that of other species. So far, the commonly used methods for lncRNA functional prediction are based on co-expression networks [57], miRNA regulation [66], protein binding [67], epigenetic modification [68], and adjacent gene functions. In this study, due to the influence of sequencing data (insufficient sample size) we cannot make functional predictions through the methods of co-expression networks and others. These methods are only based on the functional predictions of bioinformatics, so the accurate assignment of functions of lncRNAs still requires verification through biological experiments. However, with the development of biotechnology and more information becoming known about lncRNAs, their important functions in plant growth and development will be uncovered gradually.