Single-Molecule Long-Read Sequencing of Zanthoxylum bungeanum Maxim . Transcriptome : Identification of Aroma-Related Genes

Zanthoxylum bungeanum Maxim. is an economically important tree species that is resistant to drought and infertility, and has potential medicinal and edible value. However, comprehensive genomic data are not yet available for this species, limiting its potential utility for medicinal use, breeding programs, and cultivation. Transcriptome sequencing provides an effective approach to remedying this shortcoming. Herein, single-molecule long-read sequencing and next-generation sequencing approaches were used in parallel to obtain transcript isoform structure and gene functional information in Z. bungeanum. In total, 282,101 reads of inserts (ROIs) were identified, including 134,074 full-length non-chimeric reads, among which 65,711 open reading frames (ORFs), 50,135 simple sequence repeats (SSRs), and 1492 long non-coding RNAs (lncRNAs) were detected. Functional annotation revealed metabolic pathways related to aroma components and color characteristics in Z. bungeanum. Unexpectedly, 30 transcripts were annotated as genes involved in regulating the pathogenesis of breast and colorectal cancers. This work provides a comprehensive transcriptome resource for Z. bungeanum, and lays a foundation for the further investigation and utilization of Zanthoxylum resources.


Introduction
Zanthoxylum bungeanum Maxim. is a member of the evergreen and deciduous trees and shrubs of the citrus (Rutaceae) family that is native to China, where it is mainly used as a condiment and oilseed crop [1].Although the popularity of Z. bungeanum as a condiment is relatively low outside of China, its medicinal value merits attention.This species has been used as an ancient herbal medicine in China, and its pericarp shows anti-inflammatory and anti-bacterial activities [2], relieves arthritis [3], protects against gastric mucosal damage [4], lowers blood lipid levels [5], regulates antithrombotic effects [6], and enhances immunity [7].A recent study found that collagen extracted from Z. schinifolium exerts positive anti-tuberculosis effects [8].More interestingly, genes associated with breast and colorectal cancers were identified during the functional annotation of Z. bungeanum in this work.The discovery of artemisinin is a powerful reminder of the need to devote more attention to seemingly niche plants that possess unique medicinal qualities in order to tap their potential value.
Despite its long history of use, Z. bungeanum is a plant that remains poorly understood.This species grows on sloping land and benefits from strong resistance to stress.However, while physiological studies of this phenomenon have been performed, in-depth molecular knowledge is lacking.In addition, Z. bungeanum is an apomictic plant whose male flowers are not found in cultivars [9,10], yet research on this aspect is also scarce.Currently, available Z. bungeanum resources are mostly cultivars.It remains difficult to find wild species, which impedes the origin and classification analysis.Moreover, Z. bungeanum contains numerous chromosomes and its ploidy is still unclear.Due to its complex genetic information, the genomics research of Z. bungeanum has progressed slowly.The lack of genomic and high-quality transcriptome information acts as a barrier to such studies and hampers a full exploration of the medicinal value of this plant.
The advent of single-molecule real-time (SMRT) sequencing technology has injected new vitality into genome and transcriptome research, which provides useful gene structure and function information for many organisms [11,12].High-quality SMRT transcriptome sequencing is particularly powerful for investigating alternative splicing, alternative polyadenylation, novel genes, and non-coding RNAs [13].Most non-model organisms lack genomic data, and full-length transcripts derived using the PacBio platform can greatly facilitate basic and applied research on gene function, gene expression regulation, and evolutionary relationships [14][15][16].Next-generation sequencing (NGS) technology also has advantages, including high-throughput, improved accuracy, and almost universal application [17][18][19].NGS and SMRT sequencing approaches complement each other, and their combination facilitates more complete transcriptome resources [20].
The present study reports the first full-length transcriptome of Z. bungeanum from five tissues using SMRT sequencing technology.Short reads obtained by NGS were used to correct the transcript isoforms that were obtained using SMRT sequencing.Annotation of the Z. bungeanum transcript isoforms sheds light on the complexity of the molecular mechanisms underpinning the medicinal and edible functions.This study provides genetic resources and transcriptome information to support further research on Z. bungeanum.

Plant Materials and Sample Preparation
The Z. bungeanum cultivar "Fuguhuajiao" was grown in an experimental field at the Research Centre for Engineering and Technology of Zanthoxylum, State Forestry Administration, Northwest A&F University, Fengxian, Shaanxi Province, China.Shoots (sprouting stage), leaves (leaf expansion stage), stems (branching stage), flowers (flowering stage), and fruits (fruiting stage) were collected in approximately equal weights from the east, west, south, and north directions from a 6-year-old plant (Figure 1).Collected samples were quickly frozen in liquid nitrogen and stored at −80 C until used.

RNA Preparation
The total RNA was extracted from each tissue using an RNeasy Plus Mini Kit (Qiagen, Valencia, CA, USA).The RNA quality, integrity, and quantities were determined using a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and an Agilent Bioanalyzer 2100 system (Agilent Technologies, Palo Alto, CA, USA).For all samples, the RNA integrity number was >7.5.Qualified RNA samples were subsequently used for constructing complementary DNA (cDNA) libraries.

PacBio cDNA Library Construction and Sequencing
The cDNA was synthesized using mixed RNA from all five tissues with a SMARTer PCR cDNA Synthesis Kit (Clontech, Palo Alto, CA, USA; Cat.No. 634926).Full-length cDNA fragments were screened using a BluePippin Size Selection System (Sage Science, Beverly, MA, USA), and three cDNA libraries (1−2 kb, 2−3 kb, and >3 kb) were constructed.Selected full-length cDNA was re-amplified by PCR and the SMRT bell hairpin loop was ligated to the cDNA.A Qubit 2.0 fluorimeter (Life Technologies, Carlsbad, CA, USA) was then used for accurate quantification of the libraries after a secondary screening with the BluePippin system.Library size was measured using an Agilent Bioanalyzer 2100 system, and sequencing was performed on a PacBio RSII instrument (Pacific Bioscience, Menlo Park, CA, USA).

Illumina cDNA Library Construction and Sequencing
The mRNA was purified and enriched from mixed total RNA using oligo (dT) magnetic beads.A fragmentation buffer was added to break the mRNA into random fragments using divalent cations under elevated temperatures.The resultant mRNA fragments were used as templates to synthesize the first-strand cDNA using random hexamers.The second cDNA strand was synthesized using buffer, deoxyribonucleotide triphosphates (dNTPs), Ribonuclease H (RNase H), and DNA polymerase I, after which the cDNA was purified using AMPure XP (Beckman, Brea, CA, USA) beads.Double-stranded cDNAs were further end-repaired, poly(A)-tailed, and ligated to sequencing adapters.The purified and repaired double-stranded cDNA fragments were size-selected again with AMPure XP beads and enriched by PCR.Qualifying libraries were sequenced using an Illumina HiSeq instrument (Illumina Inc., San Diego, CA, USA).

Quality Filtering and Error Correction of PacBio Long Reads
Amplified sequences obtained by SMRT sequencing which were smaller than 50 bp and had a quality score of less than 0.75 were removed by filtering.The application of full passes that were ≥0 resulted in the extraction of reads of inserts (ROIs) from filtered subreads.ROIs were separated into full-length and non-full-length reads by detecting whether 5 primers, 3 primers, and polyA tails were included in ROIs.Iterative Clustering for Error Correction (ICE) [21] was then used to obtain consensus isoforms, and high-quality consensus isoforms from ICE were polished using Quiver.Clean reads obtained after the removal of adapters and low-quality reads from raw Illumina NGS data were used to correct low-quality full-length transcript isoforms using Proovread software [22].High-quality and corrected low-quality transcript isoforms were confirmed as de-redundant using CD-HIT software [23] to obtain non-redundant isoforms.

Analysis of Detecting Coding Sequence (CDS), Simple Sequence Repeat (SSR), and Long Non-Coding RNA (lncRNA) Features
Each open reading frame (ORF) detected by the TransDecoder v.3.0.0 (https://github.com/TransDecoder/TransDecoder/releases) was defined as a putative detecting coding sequence (CDS).Predicted CDS types were classified as complete (both start and stop codons were predicted), 5 end (5 -partial, only the start codon was predicted), 3 end (3 -partial, only the stop codon was predicted), or internal segment (neither the start codons nor the stop codons were predicted).

General Properties of Single-Molecule Long-Reads
The total RNA from five tissues (shoots, leaves, stems, flowers, and fruits) was combined in equal amounts to acquire a wide coverage of full-length transcripts in Z. bungeanum for single-molecule long-read sequencing.Three cDNA libraries of different sizes (1−2 kb, 2−3 kb, and >3 kb) were constructed and sequenced using a PacBio RSII sequencing system.Five SMRT cells generated 113,885, 102,613, and 65,603 ROIs from the three libraries.A total of 282,101 ROIs were generated, and >47% (134,400) were full-length ROIs, based on the presence of five primers, three primers, and poly(A) tails.The average length of the ROIs was 2819 bp (Table 1 and Figure 2).

Acquisition of High-Quality Sequences and Error Correction of Long-Reads Using Illumina Data
In order to generate high-quality transcripts, an ICE-based algorithm was used to iteratively cluster sequences.After clustering similar sequences, consensus isoforms were obtained for each cluster.In combination with non-full-length sequences, we used the quiver program to correct consensus sequences for each cluster, and finally obtained high-quality transcripts with an accuracy rate of >99%.We eventually obtained 75,973 consensus isoform sequences with 58,445 high-quality transcript isoforms (Table 2).In order to generate high-quality transcripts, an ICE-based algorithm was used to iteratively cluster sequences.After clustering similar sequences, consensus isoforms were obtained for each cluster.In combination with non-full-length sequences, we used the quiver program to correct consensus sequences for each cluster, and finally obtained high-quality transcripts with an accuracy rate of >99%.We eventually obtained 75,973 consensus isoform sequences with 58,445 high-quality transcript isoforms (Table 2).In parallel, 23,869,770 clean reads were sequenced using an Illumina platform and subsequently used for correcting low-quality consensus isoforms from the PacBio data.A total of 68,710 transcriptional sequences were obtained after the removal of redundant transcripts, which were suitable for further structural and functional analysis.

Predictions of ORFs, SSRs, and lncRNAs
A total of 65,711 ORFs were predicted, 52,331 of which were complete CDSs.The number and length distribution of proteins encoded by CDS regions are shown in Figure 3.   SSR markers can serve as useful tools for the marker-assisted breeding of numerous organisms [29].Herein, 50,135 SSRs and 30,745 SSR-containing sequences were detected across 68,698 transcripts (>500 bp) from Z. bungeanum.Of these, 11,707 transcripts contained more than one SSR, and 6273 were compound SSRs.The number of SSRs per transcript ranged from 1 to 14. Mononucleotide repeat transcripts were the most frequent type (25,123, 57.28%) with 10549 repeats, followed by 6858 (15.64%) tri-type transcripts with 533 repeats, 5611 (12.79%) di-type transcripts with 666 repeats, and 514 (1.17%) tetra-type transcripts with 520 repeats.Fewer hexa-type transcripts (334, 0.76%) and penta-type transcripts (194, 0.44%) were observed, with 512 and 58 repeats, respectively (Figure 4).Four computational approaches (CPC, CNCI, CPAT, and Pfam) were combined to distinguish non-protein-coding RNA candidates from putative protein-coding RNAs among the unknown transcripts.From the four analysis approaches, 3261, 3054, 10,520, and 9877 transcripts with lengths >200 nt and more than two exons were respectively selected as lncRNA candidates.Together, 1492 lncRNA transcripts were predicted for subsequent lncRNA analysis (Figure 5).transcripts.From the four analysis approaches, 3261, 3054, 10,520, and 9877 transcripts with lengths >200 nt and more than two exons were respectively selected as lncRNA candidates.Together, 1492 lncRNA transcripts were predicted for subsequent lncRNA analysis (Figure 5).

Functional Annotation of Transcripts
Transcript function was annotated using eight databases (NR, EggNOG, Pfam, Swiss-Prot, GO, KOG, KEGG, and COG).Of 68,710 transcripts, 66,823 (97.25%) matched at least one of the above

Gene Ontology (GO) Annotation
Using all ORF-containing transcripts for functional annotation, 4796 gene ontology (GO) terms were assigned to 45,982 PacBio transcript isoforms.These GO terms were classified into three main biological categories (cellular component, molecular function, and biological process).Biological process (2403, 50.10%) and molecular function (1932, 40.28%) represented the largest number of transcript isoforms, followed by cellular component (461, 9.61%).

EggNOG Annotation
EggNOG is a database of orthologous proteins and functional annotations that extends the non-supervised homology group based on the COG database [30].After searching transcript isoforms against the EggNOG database, 68,999 with functional information were annotated and divided into 25 categories.Among these, the "general function prediction only" group accounted for the largest number of transcripts (10,883, 15.77%), followed by signal transduction mechanisms (6797, 9.85%), posttranslational modification, protein turnover, and chaperones (6081, 8.81%), transcription (5257, 7.62%), carbohydrate transport and metabolism (3456, 5.01%), and intracellular

EggNOG Annotation
EggNOG is a database of orthologous proteins and functional annotations that extends the non-supervised homology group based on the COG database [30].After searching transcript isoforms against the EggNOG database, 68,999 with functional information were annotated and divided into 25 categories.Among these, the "general function prediction only" group accounted for the largest number of transcripts (10,883, 15.77%), followed by signal transduction mechanisms (6797, 9.85%), posttranslational modification, protein turnover, and chaperones (6081, 8.81%), transcription (5257, 7.62%), carbohydrate transport and metabolism (3456, 5.01%), and intracellular trafficking, secretion, and vesicular transport (2606, 3.78%).Only two transcripts were annotated in the cell motility category (Figure 8).

Analysis of KEGG Pathways and Gene Annotation Information
A total of 30,947 transcripts from Z. bungeanum were annotated to 126 reference metabolic pathways in the KEGG database.Pathways with more annotated transcripts were mainly related to the normal growth of plants, namely carbon metabolism (1471), biosynthesis of amino acids (1212), protein processing in the endoplasmic reticulum (1015), plant hormone signal transduction (953),

Analysis of KEGG Pathways and Gene Annotation Information
A total of 30,947 transcripts from Z. bungeanum were annotated to 126 reference metabolic pathways in the KEGG database.Pathways with more annotated transcripts were mainly related to the normal growth of plants, namely carbon metabolism (1471), biosynthesis of amino acids (1212), protein processing in the endoplasmic reticulum (1015), plant hormone signal transduction (953), RNA transport (900), spliceosome (896), starch and sucrose metabolism (823), mRNA surveillance pathway (773), and oxidative phosphorylation (737; Supplementary Figure S1).
Functional annotation revealed 30 cancer-related transcripts, with 18 transcripts annotated in the Pfam database, 11 transcripts in the SwissProt 10 transcripts in the KEGG database, and one in the NR database.These transcripts were colon cancer-associated protein Mic1-like and breast cancer susceptibility 1 and 2 (BRCA1 and BRCA2) homologs (Supplementary Table S1).

Evaluation of SMRT Sequencing Quality
In this study, we performed high-quality, full-length transcript isoform sequencing using a combination of SMRT sequencing and NGS technology to investigate gene information in Z. bungeanum.In total, 282,101 ROIs including 134,074 full-length non-chimeric reads were produced, and 75,973 consensus isoforms with 58,445 high-quality transcript isoforms were identified using ICE analysis of full-length non-chimeric reads.After error correction using reads sequenced with the Illumina platform, and the removal of redundant transcripts by way of CD-HIT software, 68,710 non-redundant transcript isoforms were obtained.
The high quality of the SMRT data was confirmed based on the comprehensive sequencing results.First, the average transcript lengths in the 1−2 kb, 2−3 kb, and 3−6 kb libraries were 2000, 2891, and 3567 bp, respectively, and the long total average length was 2819 bp, which indicates that ROIs were long enough to represent full-length transcripts [31].Second, the Illumina data were used to correct low-quality SMRT reads, which can reduce the error rate of SMRT sequencing [32].Thus, the resulting comprehensive transcriptome database for Z. bungeanum provided insight into the structures and functions of genes.

Application of lncRNA and NR Annotation
In this work, 1,492 lncRNAs were predicted in Z. bungeanum using all four computational approaches.Unexpectedly, the predicted lncRNAs were exceptionally long, with an average length of 2.5 kb (range = 0.3−10.0kb).LncRNAs play an important role in the physiology and development of plants, especially in some key biological processes.However, only a fraction of the functions of lncRNAs have been discovered [33,34].This almost untouched gene pool could include genes associated with agronomically relevant traits related to unresolved issues.
For example, the PN-LNC-N13 lncRNA discovered in Paspalum notatum Flugge is only expressed in apomictic plants, and there may be structural differences between apomixis and sexual types [34].Studies have confirmed that lncRNAs participate in abiotic stress responses and act as regulatory factors [35,36].Cui et al. [37] found that an lncRNA participates in responses to Phytophthora infestans (Montagne) de Bary in tomato, further suggesting that lncRNAs may also play a role in enhancing biological resistance.The above research suggests that lncRNA data can provide useful gene resource information of potential medicinal value, as well as apomixis and stress tolerance in Z. bungeanum.In addition, lncRNAs are species-and tissue-specific, and can provide inspiration for studying the characteristic traits of Z. bungeanum, such as numb taste components [38,39].
NR analysis revealed that Z. bungeanum is highly homologous to C. sinensis and C. clementina, and all belong to the family Rutaceae.Since the study of Z. bungeanum is still in its infancy, research on many problems must draw on knowledge of other plants.Fortunately, research on citrus plants is more in-depth, and sources for apomictic reproduction and stress resistance have been developed [40,41].

Excavation of KEGG Annotation Pathways Gene Annotation Information in Z. bungeanum
A large number of transcripts from Z. bungeanum were associated with metabolic pathways, indicating that the growth and development of Zanthoxylum requires varied metabolic support.This also shows that there are multiple functional metabolites in Z. bungeanum, many of which may be of medicinal value.Although some pathways were associated with fewer transcripts, they may still be worth noting.
Aroma is an important quality attribute of Zanthoxylum products.Volatile oils are mainly responsible for aroma in Zanthoxylum and the main components are terpenoids [42].In the KEGG annotation results, we identified the limonene and pinene degradation pathway (ko00903; Supplementary Figure S2A), the arachidonic acid metabolism pathway (ko00590; Supplementary Figure S2B), the degradation of aromatic compounds (ko01220; Supplementary Figure S2C), the sesquiterpenoid and triterpenoid biosynthesis pathway (ko00909; Supplementary Figure S2D), the monoterpenoid biosynthesis (ko00902; Supplementary Figure S2E), and the diterpenoid biosynthesis (ko00904; Supplementary Figure S2F).These seven pathways are most likely related to the metabolism of aroma components in the pericarp of Zanthoxylum, which provides an effective resource for future studies in this area.
Another decisive factor affecting the quality of Zanthoxylum is the color of the fruit.The color of ripe Zanthoxylulm fruit varies in different varieties.Interestingly, five of the annotated transcripts were associated with the anthocyanin biosynthesis pathway (ko00942; Supplementary Figure S2G).Thus, studies on the anthocyanin metabolic pathway could improve the color of Zanthoxylum fruit, benefitting production.
Human disease genes have been reported in plants, where they may perform different functions.For example, the BRCA1 homolog in Arabidopsis has a most conserved region PHD (plant homeodomain), which is absent in mammals [43].Numerous genes that participate in genome stability and cancer predisposition in animals are well conserved in plants [44].Herein, functional annotation revealed the presence of 30 cancer-related transcripts in Z. bungeanum.A study has shown that the BRCA2-RAD51 (recombination protein) complex plays a transcriptional regulatory role in plant immune response [45].This provides another idea for us to study their roles in genome stability and transcriptional regulation of plant growth in Zanthoxylum resources.

Conclusions
In the present study, high-quality Full-length non-chimeric transcripts, ORFs, SSRs, and lncRNAs were detected in Z. bungeanum by SMRT sequencing.Metabolic pathways related to the aroma and color of Z. bungeanum pericarp were discovered through gene functional annotation.Unexpectedly, tumor suppressors involved in the pathogenesis of human breast and colorectal cancers were also annotated in Z. bungeanum.Our work provides comprehensive genetic resources for further molecular research on the cultivation, genetics, and quality characteristics of Z. bungeanum.Meanwhile, the discovery of anti-oncogene information corroborates the potential medicinal value of Z. bungeanum.
Author Contributions: A.W. and J.T. conceived and designed the experiments.S.F., Y.L., and T.Y. aided in study design.J.T. performed the experiments, analyzed the results, and wrote the manuscript.L.Z., L.T., and Y.H. performed some of the experiments.All authors reviewed and approved the manuscript.

Figure 1 .
Figure 1.Zanthoxylum bungeanum Maxim.samples collected from the shoots, leaves, stems, flowers, and fruits of a 6-year-old individual in this study.

Figure 1 .
Figure 1.Zanthoxylum bungeanum Maxim.samples collected from the shoots, leaves, stems, flowers, and fruits of a 6-year-old individual in this study.

Forests 2018, 9 ,Figure 2 .
Figure 2. Length distribution of 282,101 reads of inserts (ROIs) from Z. bungeanum; (a) number and length distribution of ROIs; (b) histogram of ROI read length frequency distribution.

Figure 2 .
Figure 2. Length distribution of 282,101 reads of inserts (ROIs) from Z. bungeanum; (a) number and length distribution of ROIs; (b) histogram of ROI read length frequency distribution.

Forests 13 Figure 3 .Figure 3 .
Figure 3. Distribution of 52,331 complete coding sequences (CDSs) of complete open reading frames (ORFs) from Z. bungeanum.SSR markers can serve as useful tools for the marker-assisted breeding of numerous organisms [29].Herein, 50,135 SSRs and 30,745 SSR-containing sequences were detected across 68,698 transcripts (>500 bp) from Z. bungeanum.Of these, 11,707 transcripts contained more than one SSR, and 6273 were compound SSRs.The number of SSRs per transcript ranged from 1 to 14. Mononucleotide repeat transcripts were the most frequent type (25,123, 57.28%) with 10549 repeats,Figure 3. Distribution of 52,331 complete coding sequences (CDSs) of complete open reading frames (ORFs) from Z. bungeanum.

Figure 6 .
Figure 6.Species most closely related to Z. bungeanum in the NCBI database of non-redundant protein sequences database.

Figure 7 .
Figure 7. Gene Ontology (GO) classification of annotated transcripts from Z. bungeanum.GO terms were classified into three main categories: cellular component, molecular function, and biological process.The bottom x-axis indicates the number of transcripts, and the top x-axis indicates the percentage of transcripts.

Figure 7 .
Figure 7. Gene Ontology (GO) classification of annotated transcripts from Z. bungeanum.GO terms were classified into three main categories: cellular component, molecular function, and biological process.The bottom x-axis indicates the number of transcripts, and the top x-axis indicates the percentage of transcripts.

Figure 8 .
Figure 8. Evolutionary genealogy of genes based on Non-supervised Orthologous Groups (EggNOG) annotation of Z. bungeanum transcripts.The x-axis indicates EggNOG categories, and the y-axis indicates the number of transcripts.

Figure 8 .
Figure 8. Evolutionary genealogy of genes based on Non-supervised Orthologous Groups (EggNOG) annotation of Z. bungeanum transcripts.The x-axis indicates EggNOG categories, and the y-axis indicates the number of transcripts.

Funding:
This work was supported by the National Key Research and Development Program of China (2016YFC0501706).

Table 2 .
Results of iterative clustering for error correction in Z. bungeanum.