Single-Molecule Real-Time Sequencing of the Madhuca pasquieri (Dubard) Lam. Transcriptome Reveals the Diversity of Full-Length Transcripts

: Madhuca pasquieri (Dubard) Lam. is a tree on the International Union for Conservation of Nature Red List and a national key protected wild plant (II) of China, known for its seed oil and timber. However, lacking of genomic and transcriptome data for this species hampers study of its reproduction, utilization, and conservation. Here, single-molecule long-read sequencing (PacBio) and next-generation sequencing (Illumina) were combined to obtain the transcriptome from ﬁve developmental stages of M. pasquieri . Overall, 25,339 transcript isoforms were detected by PacBio, including 24,492 coding sequences (CDSs), 9440 simple sequence repeats (SSRs), 149 long non-coding RNAs (lncRNAs), and 182 alternative splicing (AS) events, a majority was retained intron (RI). A further 1058 transcripts were identiﬁed as transcriptional factors (TFs) from 51 TF families. PacBio recovered more full-length transcript isoforms with a longer length, and a higher expression level, whereas larger number of transcripts (124,405) was captured in de novo from Illumina. Using Nr, Swissprot, KOG, and KEGG databases, 24,405 transcripts (96.31%) were annotated by PacBio. Functional annotation revealed a role for the auxin, abscisic acid, gibberellin, and cytokinine metabolic pathways in seed germination and post-germination. These ﬁndings support further studies on seed germination mechanism and genome of M. pasquieri , and better protection of this endangered species. biosynthesis biosynthesis, biosynthesis, biosynthesis, biosynthesis,

Forests 2020, 11, 866 3 of 23 analysis by SMRT. The PacBio Sequel platform has been used to generate comprehensive full-length transcriptome of M. pasquieri, and combined with Illumina platform to obtain a more complete transcriptome. In this study, Illumina RNA-Seq was used to correct short-read errors on SMRT transcripts obtained from PacBio, allowing differences to be compared between the two platforms. Then, we functionally annotated the full-length transcriptomes. Isoform analysis revealed the complexity of AS in M. pasquieri, and lncRNAs were also identified. Thus, we systematically characterized the complexity of the M. pasquieri transcriptome, as well as its structure and functional annotation. This in-depth characterization will provide a valuable tool for understanding the seed germination and growth mechanism of M. pasquieri and for future conservation purposes. Furthermore, this transcriptome provides basic data and important references for future studies on functional gene mining and utilization, genetic resource classification and evolution, and molecular marker development to promote the efficient and sustainable exploitation of this precious biological resource.

Plant Materials
M. pasquieri was grown in an artificial climate chamber, with a light cycle of 14 h/10 h (day/night), 17,600 lx, temperature 25 • C, and humidity of 60%-80%, at South China Agricultural University in China. During seed germination and post-germination growth, M. pasquieri plants were selected based on five developmental stages from the same batch of light matrix culture in the artificial climate chamber (seed germination, hypocotyl elongation, epicotyl elongation, two-leaf, and nine-leaf stages; Figure 1), with three biological replicates per stage. Collected samples were snap frozen in liquid nitrogen and stored at −80 • C until use.

Library Construction and SMRT Sequencing
Total M. pasquieri plants from five developmental stages, with three biological replicates per stage, were pooled. Total RNA was extracted by grinding tissue in TRIzol reagent (Life Technologies, Carlsbad, CA, USA) on dry ice and processed following the manufacturer's protocol. RNA integrity was determined using the Agilent 2100 Bioanalyzer and agarose gel electrophoresis. RNA purity and concentration were determined via a Nanodrop micro-spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA). mRNA was enriched by Oligo (dT) magnetic beads, and then reverse-transcribed into cDNA using the Clontech SMARTer PCR cDNA Synthesis Kit (Clontech Laboratories, CA, USA). The PCR program was optimized to determine the optimal number of amplification cycles for the downstream large-scale PCR. Then, the optimized cycle number was used to generate double-stranded cDNA. In addition, cDNA of >4 kb was selected using the BluePippin TM Size Selection System (Sage Science, Beverly, MA, USA) and mixed equally with the no-size-selection cDNA. Large-scale PCR was also performed to construct the next SMRTbell library. cDNA underwent DNA-damage repair, end-repair, and was ligated to sequencing adapters. The SMRTbell template was annealed to a sequencing primer, bound to a polymerase, and sequenced on the PacBio Sequel platform using P6-C6 chemistry with 10 h movies.

Analysis of SMRT Sequencing Data
The raw sequencing reads of cDNA libraries were classified and clustered into a transcript consensus using the SMRT Link v5.0.1 pipeline supported by Pacific Biosciences. CCS reads were extracted from subreads BAM files and then were classified as FL non-chimeric, non-full-length (nFL), chimeras, or short reads based on cDNA primers and the polyA tail signal. Short reads were discarded. Subsequently, the full-length non-chimeric (FLNC) reads were clustered by Iterative Clustering for Error Correction (ICE) software to generate the cluster consensus isoforms. To improve the accuracy of PacBio reads, two strategies were employed: (1) the nFL reads were used to polish the obtained cluster consensus isoforms by Quiver software to attain FL polished high-quality consensus sequences (accuracy ≥ 99%). (2) The LoRDEC tool (version 0.8) was used to further correct the low-quality isoforms using Illumina short reads obtained from the same samples. Then, the final transcriptome isoform sequences were filtered by removing redundant sequences using the software CD-HIT-v4.6.7 with a threshold of 0.99 identities (Figure 1).
Forests 2020, 11, x FOR PEER REVIEW 4 of 25 Figure 1. Images of Madhuca pasquieri (Dubard) Lam., which was used for sequencing, and the workflows used in this study.

Library Construction and SMRT Sequencing
Total M. pasquieri plants from five developmental stages, with three biological replicates per stage, were pooled. Total RNA was extracted by grinding tissue in TRIzol reagent (Life Technologies, Carlsbad, CA, USA) on dry ice and processed following the manufacturer's protocol. RNA integrity was determined using the Agilent 2100 Bioanalyzer and agarose gel electrophoresis. RNA purity and concentration were determined via a Nanodrop micro-spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA). mRNA was enriched by Oligo (dT) magnetic beads, and then reverse-transcribed into cDNA using the Clontech SMARTer PCR cDNA Synthesis Kit (Clontech Laboratories, CA, USA). The PCR program was optimized to determine the optimal number of amplification cycles for the downstream large-scale PCR. Then, the optimized cycle number was used to generate double-stranded cDNA. In addition, cDNA of >4 kb was selected using the BluePippin TM Size Selection System (Sage Science, Beverly, MA, USA) and mixed equally with the no-size-selection cDNA. Large-scale PCR was also performed to construct the next SMRTbell library. cDNA underwent DNA-damage repair, end-repair, and was ligated to sequencing adapters. The SMRTbell template was annealed to a sequencing primer, bound to a polymerase, and

Illumina RNA Sequencing and De Novo Assembly of Short Reads
M. pasquieri plants sampled at five developmental stages, with three biological replicates per stage, were each used for Illumina RNA sequencing. After total RNA was extracted, eukaryotic mRNA with a polyA tail was enriched by Oligo (dT) beads, and then the enriched mRNA was fragmented into short fragments by ultrasonic waves and reverse-transcribed into cDNA using random primers. Second-strand cDNA was synthesized by DNA polymerase I, RNase H, dNTP, and buffer (New England Biolabs, Ipswich, MA, USA). Next, the cDNA fragments were purified using a QiaQuick PCR extraction kit (Qiagen, Düsseldorf, GER) end-repaired, the polyA was added, and the fragments were then ligated to Illumina sequencing adapters. The ligation products were size-selected by agarose gel electrophoresis, amplified by PCR, and sequenced using Illumina HiSeq TM Reads obtained from the sequencing machines included raw reads containing adapters or low-quality bases, which affect subsequent assembly and analysis. Thus, high-quality clean reads were obtained by further filtering according to the following rules: (1) removal of reads containing adapters; (2) removal of reads containing more than 10% of unknown nucleotides (N); (3) removal of reads containing all A bases; (4) removal of low-quality reads containing more than 50% low-quality (Q-value ≤ 20) bases. After filtering the data, base composition and mass distribution were analyzed to visualize data quality. The more balanced the base composition, the higher the quality, and the more accurate the subsequent analysis will be. Then, Trinity v2.8.4 software was used to assemble reads (Figure 1), and the quality of the assembly could be evaluated from the N50 value.

Evaluation of Sequencing Results
The protein sequences predicted from two sequencing results were analyzed using BUSCO v3 i to determine the completeness of the conserved content in the transcriptome. The percentage of transcripts that fully aligned (≥70%) and partially aligned to the conserved proteins, as well as the percentage missing proteins were determined and compared.

Prediction of Coding Sequences (CDSs), Simple Sequence Repeats (SSRs), and Transcription Factors (TFs)
Open reading frames (ORFs) in the isoform sequences were detected using ANGEL software in order to determine the CDSs, protein sequences, and untranslated region (UTR) sequences.
SSR prediction was analyzed using the MISA (version 1) software (http://pgrc.ipk-gatersleben.de/ misa/) 64 with default parameters in the whole transcriptome. Based on the MISA results, Primer 1.1.4 was used to design primer pairs specific for the flanking regions of SSRs for subsequent validation.

Characterization of AS Events
To analyze AS events of transcript isoforms, the COding GENome reconstruction Tool (Cogent) was first used to partition transcripts into gene families based on k-mer similarity, and to reconstruct each family into a coding reference genome based on De Bruijn graph methods. Then, the SUPPA tool was used to analyze AS events of transcript isoforms. Five major types of AS events, namely A3 (alternative 3 splice sites), A5 (alternative 5 splice sites), AF (alternative first exon), RI (retained intron), and SE (skipping exons), were extracted from the output files and counted.

LncRNA Identification from PacBio Sequences
CNCI (version 2), CPAT, CPC (version 1), and Pfam were used to assess the protein-coding potential of transcripts without annotations by default parameters for potential lncRNAs. To better annotate lncRNAs on an evolutionary level, the software Infernal (http://eddylab.org/infernal/) was used for sequence alignment. LncRNAs were classified based on their secondary structures and sequence conservation.

Functional Annotation
Corrected isoforms were analyzed by BLAST against the NCBI non-redundant protein (Nr) database (http://www.ncbi.nlm.nih.gov), the Swiss-Prot protein database (http://www.expasy.ch/sprot), the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.genome.jp/kegg), and the COG/KOG database (http://www.ncbi.nlm.nih.gov/COG) using the BLASTx program (http: //www.ncbi.nlm.nih.gov/BLAST/) at an E-value threshold of 1 × 10 −5 to evaluate sequence similarity with genes of other species. Gene Ontology (GO) annotation was analyzed by Blast2GO software with Nr annotation results of isoforms. Isoforms with the top 20 highest scores, and no shorter than 33 high-scoring segment pair (HSP) hits were selected for Blast2GO analysis. Then, isoforms were functionally classified using WEGO software.

General Properties of Single-Molecule Long-Reads
In order to obtain M. pasquieri transcripts that were as complete as possible, high-quality total RNA was extracted from each pooled sample representing the five different developmental stages. Because PacBio Sequel does not screen fragments, a full library of samples was built. After filtering, 22,704,140 subreads were obtained, with a mean length of 1193 bp and a N50 of 1529 bp. A total of 438,795 CCSs with an average depth of nine passes were generated from subreads after merging and correcting errors by multiple sequencing. The length distribution of CCSs was consistent with the expected size ( Figure 2a). Furthermore, ICE and Quiver algorithms were used to obtain 29,003 high-quality sequences and 85 low-quality sequences. The length distribution of consensus isoforms is shown in Figure 2b.

Acquisition of High-Quality Sequences and Error Correction of Long Reads Using Illumina Data
The basic error rate of the SMRT sequences was 12-15%, mainly due to the insertion of extra bases. Low-quality sequences obtained on the PacBio Sequel platform were corrected using Illumina RNA-Seq transcripts and LoRDEC (version 0.8). After polishing, low-quality sequences with polish coverage (percentage of bases corrected by the second-generation data in the third-generation consistent sequence) of more than 99% were combined with the high-quality sequences obtained by Quiver polish. Finally, 29,042 sequences were obtained, with a mean length of 1438.11 bp, a N50 of 1645 bp, and GC content of 44.59% (Table S1). Then, cd-hit-v4.6.7 software was used to remove redundant sequences from the high-quality consistent sequences in the library. Local alignment was adopted, where the alignment rate was 99% for shorter sequences and the number of bases unaligned was less than 30 bp. For longer sequences, the alignment rate was 90% and the number of bases unaligned was less than 100 bp. The final set of PacBio transcript isoforms contained 25,339 sequences, with a mean length of 1436.77 bp, a N50 of 1652 bp, and GC content of 44.39% (Table S1); the length distribution of these isoforms is shown in Figure 2c. Overall, correcting errors improved transcript prediction, with more transcripts covering the full-length of known proteins, and a longer N50, which is suitable for further structural and functional analysis. To assess the completeness of our transcriptome, BUSCO was used to evaluate the sequencing results and showed that 26.81% were complete and single-copy BUSCOs, 12.22% were complete and duplicated BUSCOs, 3.54% were fragmented BUSCOs, and 57.43% were missing BUSCOs ( Figure S1).

Comparison of PacBio and Illumina Transcripts and Sequencing Depth
De novo assembly has been used widely to construct transcriptomes without any reference

Acquisition of High-Quality Sequences and Error Correction of Long Reads Using Illumina Data
The basic error rate of the SMRT sequences was 12-15%, mainly due to the insertion of extra bases. Low-quality sequences obtained on the PacBio Sequel platform were corrected using Illumina RNA-Seq transcripts and LoRDEC (version 0.8). After polishing, low-quality sequences with polish coverage (percentage of bases corrected by the second-generation data in the third-generation consistent sequence) of more than 99% were combined with the high-quality sequences obtained by Quiver polish. Finally, 29,042 sequences were obtained, with a mean length of 1438.11 bp, a N50 of 1645 bp, and GC content of 44.59% (Table S1). Then, cd-hit-v4.6.7 software was used to remove redundant sequences from the high-quality consistent sequences in the library. Local alignment was adopted, where the alignment rate was 99% for shorter sequences and the number of bases unaligned was less than 30 bp. For longer sequences, the alignment rate was 90% and the number of bases unaligned was less than 100 bp. The final set of PacBio transcript isoforms contained 25,339 sequences, with a mean length of 1436.77 bp, a N50 of 1652 bp, and GC content of 44.39% (Table S1); the length distribution of these isoforms is shown in Figure 2c. Overall, correcting errors improved transcript prediction, with more transcripts covering the full-length of known proteins, and a longer N50, which is suitable for further structural and functional analysis. To assess the completeness of our transcriptome, BUSCO was used to evaluate the sequencing results and showed that 26.81% were complete and single-copy BUSCOs, 12.22% were complete and duplicated BUSCOs, 3.54% were fragmented BUSCOs, and 57.43% were missing BUSCOs ( Figure S1).

Comparison of PacBio and Illumina Transcripts and Sequencing Depth
De novo assembly has been used widely to construct transcriptomes without any reference sequence; therefore, the M. pasquieri transcriptome was assembled from Illumina short-reads to provide a comparative reference for the isoform transcript sequences obtained from PacBio Sequel. In this study, 15 samples were tested, generating an average of 45,627,996 raw reads. Then, fastp 0.18.0 was used to filter raw data and obtain clean reads (each sample > 99.5%, Table S2). After filtering, the base composition and mass distribution was analyzed to visualize data quality. The results showed that the Q20 and Q30 of each sample were both >90% and the GC content of each sample was >47% (Table S3), indicating the quality of data sequencing. Trinity v2.8.4 was used to assemble reads, resulting in 124,405 unigenes, with a mean length of 834 bp, a N50 of 1387 bp, and a GC content of 44.89% (Table S1). The length distribution of assembled unigenes is shown in Figure S2. BUSCO was then used to evaluate the sequencing results; there were 1035 (71.88%) complete and single-copy BUSCOs, 108 (7.5%) complete and duplicated BUSCOs, 170 (11.81%) fragmented BUSCOs, and 127 (8.82%) missing BUSCOs ( Figure S1).
In conclusion, these results indicated that although the SMRT sequencing depth was less than that of the Illumina platform, SMRT significantly improved the length and expression level of transcripts.

Prediction of CDSs, SSRs, and TFs
CDS is a sequence of protein products that correspond exactly to a protein codon. A total of 24,492 CDSs were predicted by PacBio Sequel, and the number and length distribution of proteins encoded by CDS regions are shown in Figure 4. Additionally, 65,297 CDSs were identified based on Illumina data; however, the mean length was less than that predicted by PacBio ( Figure S4). SSR markers can serve as useful tools for genetic diversity analysis, genetic linkage, evolutionary studies, and marker-assisted breeding in many species, especially endangered species, due to their abundance, highly polymorphic nature, co-dominant inheritance, and random distribution throughout the genome [27][28][29][30]. In this study, 9400 SSRs and 7819 SSR-containing sequences were detected across 25,339 transcripts from M. pasquieri. Of these, 1363 transcripts contained more than one SSR, and 1033 contained compound SSRs. Di-nucleotide repeat transcripts were the most frequent type (5269, 67.39%) with six to 30 repeats, followed by 1718 (21.97%) tri-nucleotide repeats transcripts with five to 24 repeats, 369 (4.72%) tetra-nucleotide repeats transcripts with four to eight repeats, 162 (2.07%) penta-nucleotide repeats, and 141 (1.80%) hexa-nucleotide repeats both with four to seven repeats ( Figure 5a). Among the di-, tri-, and tetra-nucleotide repeats, the motifs were AC/GT, AAC/GTT, and AAAT/TTTA, respectively. Detailed information is shown in Figure 5b. A total of 18,070 SSRs and 15,444 SSR-containing sequences were detected across 124,405 unigenes. Di-nucleotide repeat unigenes were also the most frequent type (11,633), followed by 5105 tri-nucleotide repeats unigenes, 1081 tetra-nucleotide repeats unigenes, 451 penta-nucleotide repeats unigenes, and 326 hexa-nucleotide repeats unigenes (Table S4). In the di-, tri-, tetra-and penta-nucleotide repeats, the motif was AG/CT, AAG/CTT, AAAT/ATTT, and AAACC/GGTTT, respectively ( Figure S5).
In conclusion, these results indicated that although the SMRT sequencing depth was less than that of the Illumina platform, SMRT significantly improved the length and expression level of transcripts.  CDS is a sequence of protein products that correspond exactly to a protein codon. A total of 24,492 CDSs were predicted by PacBio Sequel, and the number and length distribution of proteins encoded by CDS regions are shown in Figure 4. Additionally, 65,297 CDSs were identified based on Illumina data; however, the mean length was less than that predicted by PacBio ( Figure S4). SSR markers can serve as useful tools for genetic diversity analysis, genetic linkage, evolutionary studies, and marker-assisted breeding in many species, especially endangered species, due to their abundance, highly polymorphic nature, co-dominant inheritance, and random distribution throughout the genome [27][28][29][30]. In this study, 9400 SSRs and 7819 SSR-containing sequences were detected across 25,339 transcripts from M. pasquieri. Of these, 1363 transcripts contained more than one SSR, and 1033 contained compound SSRs. Di-nucleotide repeat transcripts were the most frequent type (5269, 67.39%) with six to 30 repeats, followed by 1718 (21.97%) tri-nucleotide repeats transcripts with five to 24 repeats, 369 (4.72%) tetra-nucleotide repeats transcripts with four to eight repeats, 162 (2.07%) penta-nucleotide repeats, and 141 (1.80%) hexa-nucleotide repeats both with four to seven repeats (Figure 5a). Among the di-, tri-, and tetra-nucleotide repeats, the motifs were AC/GT, AAC/GTT, and AAAT/TTTA, respectively. Detailed information is shown in Figure 5b. A total of 18,070 SSRs and 15,444 SSR-containing sequences were detected across 124,405 unigenes. Di-nucleotide repeat unigenes were also the most frequent type (11,633), followed by 5105 tri-nucleotide repeats unigenes, 1081 tetra-nucleotide repeats unigenes, 451 penta-nucleotide repeats unigenes, and 326 hexa-nucleotide repeats unigenes (Table S4). In the di-, tri-, tetra-and penta-nucleotide repeats, the motif was AG/CT, AAG/CTT, AAAT/ATTT, and AAACC/GGTTT, respectively ( Figure S5). TFs play important roles in the regulation of plant growth and development [31]. We compared the predicted protein sequences with the corresponding TF database (plant TFdb/animal TFdb) for hmmscan. A total of 1058 transcripts were identified as TFs and classified into 51

AS Events Detected from PacBio Sequel
Using the results obtained from PacBio transcript isoforms, 182 AS events were identified, including 42 (23.08%) A3, 33 (18.13%) A5, 18 (9.89%) AF, 82 (45.05%) RI, and seven (3.85%) SE, among which RI was the main AS event (Figure 6a,b). The AS events in our study largely enriched the transcript information for M. pasquieri. Due to the lack of a genome database, splice isoforms of unannotated genes remain unknown. Results from the PacBio analysis indicated that only a single isoform was detected in 109 (2.44%) genes, and two or more isoforms were found in 4353 genes (97.55%) (Figure 6c). Ten; and more than ten splice isoforms were detected in 204 (4.57%) genes. For example, 16 different COGENT000951 isoforms were identified in this study and were predicted to be associated with metabolic pathways, biosynthesis of secondary metabolites, and phenylpropanoid biosynthesis; sequencing results are shown in Figure 6d (example of A3). Additionally, 11 different COGENT002109 isoforms were identified, and the results are shown in Figure 6e, which were predicted to be associated with plant hormone signal transduction (an example of RI).

LncRNA Detected from PacBio Sequel
Four computational approaches (CNCI, CPAT, CPC, and Pfam) were combined to predict lncRNAs from putative protein-coding RNAs among the unknown transcripts. From the four different analyses, 779, 264, 866, and 933 transcripts longer than 200 nt were selected as lncRNAs, among which 149 common lncRNAs were predicted for subsequent analysis (Figure 7a). Some of these lncRNAs were up to 4000nt long (Figure 7b).

Gene Ontology (GO) Annotation
GO analysis showed that 11,810 PacBio transcript isoforms (46.61% of total set) could be divided into three groups; biological processes, molecular functions, and cellular components. Transcripts in 'biological processes' were mainly enriched for metabolic process, cellular process, single-organism process, and others ( Figure 9). Transcripts involved in 'cellular components' consisted of cell, cell part, organelle, membrane, and membrane part. For the category 'molecular function', transcripts were mainly involved in catalytic activity, binding, and transporter activity. A comparison of enriched GO terms between the PacBio transcript isoforms and de novo transcript unigenes (which had 76,548 unigenes annotated, accounting for 61.53% of the total de novo set) is presented in Figure 9.

LncRNA Detected from PacBio Sequel
Four computational approaches (CNCI, CPAT, CPC, and Pfam) were combined to predict lncRNAs from putative protein-coding RNAs among the unknown transcripts. From the four different analyses, 779, 264, 866, and 933 transcripts longer than 200 nt were selected as lncRNAs, among which 149 common lncRNAs were predicted for subsequent analysis (Figure 7a). Some of

Functional Annotation of Transcripts
All 25,339 transcripts (corrected isoforms) were functionally annotated by searching Nr, Swissprot, KOG, and KEGG databases, and 24,405 transcripts (96.31%) were annotated in PacBio. Of these, 24,358 transcripts were annotated in the Nr database, 21,059 were annotated in the Swissprot database, 16,957 were annotated in the KOG database, and 13,185 were annotated in the KEGG database (Figures 8a and S3). A total of 934 transcripts did not return any matches and may reflect novel transcripts in the M. pasquieri transcriptome. Homologous species were analyzed by comparing the transcript sequences with those in the Nr database, and the results showed that the highest numbers of transcripts were found in Vitis vinifera (3484, 14

Analysis of KEGG Pathways and Gene Annotation Information
KEGG pathway analysis provided additional functional information relating to the pathways associated with each transcript isoform, since one gene could be assigned to more than one GO term in the Gene Ontology annotation. The KEGG results demonstrated that 13,185 PacBio transcript isoforms (52.03% of the total) from M. pasquieri were annotated to 132 KEGG pathways, while 55,975 de novo transcript unigenes (44.99% of the total) were annotated to 136 KEGG pathways ( Figure 10). The functional pathway was first assigned to five KEGG biochemical pathways, including cellular processes, environmental information processing, genetic information processing, metabolism, and organismal systems. 'Metabolism' represented the largest group in both PacBio and de novo transcript datasets, containing 102 and 105 pathways, respectively. With most associated with metabolic pathway (3795/7267), biosynthesis of secondary metabolites (2189/4069), biosynthesis of antibiotics (1229/0), microbial metabolism in diverse environments (1081/0), carbon metabolism (910/1504), and biosynthesis of amino acids (679/1236). Those pathways related to genetic information processing were the second largest group, including transcripts involved in protein processing in endoplasmic reticulum (715/971), ribosome (672/2321), spliceosome (509/780), and RNA transport (387/718). The third largest group comprised cellular processes, with a majority of transcripts involved in endocytosis (381/618) and phagosome (220/430). Plant hormone signal transduction (349/436) and plant-pathogen interaction (336/728) were the most in environmental information processing and organismal systems, respectively. In addition, some important pathways were also found in M. pasquieri, including carbon fixation in photosynthetic organisms, photosynthesis, phenylpropanoid biosynthesis, flavonoid biosynthesis, anthocyanin biosynthesis, isoflavonoid biosynthesis, flavone and flavonol biosynthesis, terpenoid backbone biosynthesis, sesquiterpenoid and triterpenoid biosynthesis, monoterpenoid biosynthesis, and diterpenoid biosynthesis (Table S5). These results provide a valuable resource for investigating metabolic pathways in M. pasquieri.

Gene Ontology (GO) Annotation
GO analysis showed that 11,810 PacBio transcript isoforms (46.61% of total set) could be divided into three groups; biological processes, molecular functions, and cellular components. Transcripts in 'biological processes' were mainly enriched for metabolic process, cellular process, single-organism process, and others ( Figure 9). Transcripts involved in 'cellular components' consisted of cell, cell part, organelle, membrane, and membrane part. For the category 'molecular function', transcripts were mainly involved in catalytic activity, binding, and transporter activity. A comparison of enriched GO terms between the PacBio transcript isoforms and de novo transcript unigenes (which had 76,548 unigenes annotated, accounting for 61.53% of the total de novo set) is presented in Figure 9.

Analysis of KEGG Pathways and Gene Annotation Information
KEGG pathway analysis provided additional functional information relating to the pathways associated with each transcript isoform, since one gene could be assigned to more than one GO term in the Gene Ontology annotation. The KEGG results demonstrated that 13,185 PacBio transcript isoforms (52.03% of the total) from M. pasquieri were annotated to 132 KEGG pathways,  Plant hormone signal transduction is important for regulating germination and growth, and 349 PacBio transcript isoforms have been shown to be involved in plant hormone signal transduction pathway (ko04075 ; Table S5). Auxin, gibberellin (GA), and cytokinine signal transduction pathways accelerate seed germination and plant development, while abscisic acid (ABA) signal transduction pathway plays the opposite role. In the auxin pathway, 59 transcripts were annotated as key genes, which encoded auxin transporter protein 1 (AUX1), transport inhibitor response 1 (TIR1), auxin/indole-3-acetic acid (AUX/IAA), auxin response factor (ARF), gretchen hagen 3 (GH3), and small auxin upregulated RNA (SAUR). In the GA pathway, 22 transcripts were annotated as four key genes, GA-insensitive dwarf mutant 1 (GID1), GA-insensitive dwarf mutant 2 (GID2), DELLA, and TF. In the cytokinine pathway, nine transcripts were annotated as four key genes, cytokinin response 1 (CRE1), arabidopsis histidine-containing phosphotransfer protein (AHP), type-B arabidopsis response regulators (B-ARR), and type-A arabidopsis response regulators (A-ARR). In the ABA pathway, 47 transcripts were annotated as pyrabactin resistance/PYR-like (PYR/PYL), Protein Phosphatase 2 C (PP2C), Sucrose non-fermenting 1-related protein kinases subfamily 2 (SnRK2), and ABRE-binding factor (ABF) (ko04075; Figure S6).

Comparison of PacBio Transcripts and De Novo Unigenes
The de novo transcriptome assembly of second-generation sequencing technology has been used widely for transcriptome analysis in species without a genomic reference. Large-scale sequencing of transcriptome data by second-generation sequencing cannot generate full-length sequences or alternatively spliced forms of RNA. With the emergence of SMRT sequencing technology, full-length transcripts could be obtained without large-scale assembly. For example, de novo assembly from short reads only reconstructed 8% of PacBio isoforms in maize (Zea mays) [32]. Moreover, compared with RNA-Seq data or previously annotated references, PacBio retrieved longer transcripts, including for Amborella trichopoda [33], avocado (Persea americana) [34], and Populus alba var. pyramidalis [35]. To date, no genomic or transcriptome information for M. pasquieri Plant hormone signal transduction is important for regulating germination and growth, and 349 PacBio transcript isoforms have been shown to be involved in plant hormone signal transduction pathway (ko04075 ; Table S5). Auxin, gibberellin (GA), and cytokinine signal transduction pathways accelerate seed germination and plant development, while abscisic acid (ABA) signal transduction pathway plays the opposite role. In the auxin pathway, 59 transcripts were annotated as key genes, which encoded auxin transporter protein 1 (AUX1), transport inhibitor response 1 (TIR1), auxin/indole-3-acetic acid (AUX/IAA), auxin response factor (ARF), gretchen hagen 3 (GH3), and small auxin upregulated RNA (SAUR). In the GA pathway, 22 transcripts were annotated as four key genes, GA-insensitive dwarf mutant 1 (GID1), GA-insensitive dwarf mutant 2 (GID2), DELLA, and TF. In the cytokinine pathway, nine transcripts were annotated as four key genes, cytokinin response 1 (CRE1), arabidopsis histidine-containing phosphotransfer protein (AHP), type-B arabidopsis response regulators (B-ARR), and type-A arabidopsis response regulators (A-ARR). In the ABA pathway, 47 transcripts were annotated as pyrabactin resistance/PYR-like (PYR/PYL), Protein Phosphatase 2 C (PP2C), Sucrose non-fermenting 1-related protein kinases subfamily 2 (SnRK2), and ABRE-binding factor (ABF) (ko04075; Figure S6).

Comparison of PacBio Transcripts and De Novo Unigenes
The de novo transcriptome assembly of second-generation sequencing technology has been used widely for transcriptome analysis in species without a genomic reference. Large-scale sequencing of transcriptome data by second-generation sequencing cannot generate full-length sequences or alternatively spliced forms of RNA. With the emergence of SMRT sequencing technology, full-length transcripts could be obtained without large-scale assembly. For example, de novo assembly from short reads only reconstructed 8% of PacBio isoforms in maize (Zea mays) [32]. Moreover, compared with RNA-Seq data or previously annotated references, PacBio retrieved longer transcripts, including for Amborella trichopoda [33], avocado (Persea americana) [34], and Populus alba var. pyramidalis [35]. To date, no genomic or transcriptome information for M. pasquieri has been reported. In this study, five developmental stages of M. pasquieri were sampled to obtain more comprehensive transcript information, and 438,795 CCSs were obtained by PacBio Sequel. Due to the high error rate associated with third-generation sequencing, Illumina RNA-Seq transcripts and LoRDEC (v 0.8) were used to correct the low-quality sequences. Finally, 25,339 full-length transcripts were obtained, with a mean length of 1436.77 bp and an N50 value of 1652 bp, which will benefit further studies on M. pasquieri. However, these transcripts were shorter than reported in previous transcriptome studies of alfalfa (Medicago sativa L.) (mean length = 2551 bp, N50 = 2928 bp) [36] and Gnetum luofuense (mean length = 3237 bp, N50 = 3629 bp) [37], using the same technology. This result may be related to the differences in the parameters and nature characteristics of the species.
In this study, de novo assembly from Illumina using the same experimental material generated more transcripts (124,405 unigenes) than PacBio Sequel; however, the mean length of the transcripts was 834 bp and the N50 was 1387 bp, which were substantially shorter than those obtained by PacBio, at 1436.77 bp and 1652 bp, respectively (Table S1). The results indicate that PacBio is better able to capture long transcript sequences, similar to those reported in adlay (Coix lacryma-jobi) [38]. Although de novo assembly resulted in a higher number of transcripts and annotated transcripts (66,026), the latter accounted for only 53.07% of the total transcripts, which was much lower than the 96.31% obtained by PacBio. Notably, the annotation rate in all databases was significantly higher with PacBio sequencing data compared with Illumina data; for example, Nr, 51 (Figure 3a and Figure S3). By comparing de novo and PacBio transcripts, 10,140 unigenes and 15,564 isoforms were found in common by BLASTN, accounting for 8.2 and 61.4% respectively. Additionally, 91.8% (114,265) transcripts were found specifically in de novo assembly, and 38.6% (9775) in PacBio (Figure 3b). Thus, although de novo assembly can obtain a large number of transcripts and annotated transcripts, this may account for the great depth of reads used for assembly [20], while a large number of unannotated transcripts may contain many new transcripts. Therefore, although Illumina provides more transcripts and greater sequencing depth compared with PacBio Sequel, the PacBio Sequel method can detect more full-length transcripts and more accurately annotated transcripts; this is more conducive to obtaining accurate transcript information for M. pasquieri.

Analysis of Alternative Splicing in Transcriptomes
AS of precursor mRNAs (pre-mRNAs) during eukaryotic gene transcription may increase the number of protein isoforms produced by the removal of introns and the joining of exons [39][40][41]. The splicing mode of multi-exon mRNA may vary in several ways, and is usually divided into SE, A5, A3, Mutually Exclusive Exon (MX), RI, AF, and Alternative Last Exon (AL), leading to multiple transcripts of some genes [42]. Therefore, AS markedly increases the complexity and flexibility of the transcriptome and proteome [43]. In addition, AS is involved in the regulation of growth, development, signal transduction, flowering, and responses to various abiotic stresses [44][45][46][47]. Although RNA-Seq can accurately quantify and annotate individual AS events, it is hard to deduce full-length splicing isoforms that contain a combination of these individual events [48,49]. SMRT sequencing enables the generation of full-length sequences and the identification of complex splice isoforms, which are difficult to detect and reconstruct by RNA-Seq [50]. For example, PacBio identified more AS events in strawberry (Fragaria vesca) (17,260) compared with Illumina (12,080) [42]. In cotton (Gossypium spp.), PacBio (133,229) retrieved eight-times more AS events than Illumina (16,437) [51]. In the present study, 182 AS events were identified by PacBio Sequel in M. pasquieri, which were classified into five types, including 42 A3, 33 A5, 18 AF, 82 RI, and seven SE (Figure 6b,c). The majority of AS events were RI (45.5%), similar to previous reports in other plant species, such as sorghum (Sorghum bicolor BTx623) [52], bread wheat (Triticum aestivum L.) [53], and cassava (Manihot esculenta) [54]. In our study, these AS events greatly enriched the transcriptional information of M. pasquieri. Studies have reported specific expression of AS events in different plant tissues. For example, the proportion of different AS events varied among maize and sorghum tissues [55]. Dynamic changes in AS events occur during different development stages and tissues of strawberry; for example, anthers at floral stages 7 and 8 had more AS genes compared with anthers from other anther stages [42]. These studies also provide direction for further research on AS events in M. pasquieri.

Analysis of lncRNAs Detected by PacBio Sequel
In addition to protein-coding RNAs, non-coding RNAs constitute a major component of the transcriptome [56]. Generally, lncRNAs are more than 200 nt in length, possess no apparent CDS or ORF, and lack protein coding capability [57]. Based on their genomic location, lncRNAs can be classified as antisense, intronic, and long intergenic noncoding RNA [58]. In recent years, studies have found that lncRNAs play a significant role in the physiology and development of plants, especially in some key biological processes [59]. However, only a small number of lncRNA functions have been determined. For example, studies have confirmed that lncRNAs participate in abiotic stress responses and act as regulatory factors [60]. In a transcriptional study on soybean (Glycine max) roots under continuous salt stress, about 77% of identified lncRNAs were activated or up-regulated by more than two-fold, and functional analysis of proteins with binding and catalytic activities were major targets of these newly identified lncRNAs, indicating the regulatory role of lncRNAs in soybean roots resistant to salt stress [61]. RNA Seq short-read sequencing, which is a powerful tool used to describe gene expression, has been widely used; however, it cannot provide full-length sequences for each RNA, which also increases the difficulty of detecting lncRNAs. Nevertheless, SMRT-seq technology can effectively capture full-length sequences of the genome and transcriptome [50]. In a study investigating the maize transcriptome, SMRT-seq identified 867 novel high-confidence lncRNAs with a mean length of 1.1 kb, which were much longer than the lncRNAs identified by RNA-Seq short-read sequencing [32]. LncRNAs have not yet been identified in M. pasquieri. In the present study, 149 common lncRNAs were predicted by four programs (Figure 7a), which will contribute to the functional study of lncRNAs in M. pasquieri. Although lncRNAs were identified by PacBio Sequel in this study, they could not be classified nor further studied due to a lack of genome data for M. pasquieri. A previous study also detected 223 and 205 lncRNAs in the leaf and root of Astragalus membranaceus, respectively [62], which may be helpful for the further study of lncRNA expression in different tissues of M. pasquieri.

Analysis of Nr Annotation and Transcription Factors
Among 25,339 transcripts, a total of 24,405 transcripts were annotated using four databases (Nr, Swissprot, KOG, and KEGG), including 24,358 transcripts annotated in the Nr database, accounting for 96.13% of the total annotated transcripts (Figure 3a and Figure S3). Comparison of M. pasquieri transcripts with the Nr data revealed that M. pasquieri shares homology with Vitis vinifera (3484, 14.30%), Theobroma cacao (1432, 5.88%), Sesamum indicum (1182, 4.85%), Juglans regia (1182, 4.85%), and Nelumbo nucifera (1063, 4.36%) (Figure 8b). Vitis vinifera possesses the highest homology, which may be explained by its relatively extensive database and better annotation compared with that of other species; however, its homology ratio is relatively low compared with other species. For example, in coffee (Coffea arabica) bean, a Nr-annotated tobacco species was much larger than that of Coffea canephora (1,746,308 versus 142,656 hits; maximum 50 hits per sequence) [63]. This is not unexpected, since there is no available genomic and transcriptomic information for M. pasquieri or a comprehensive genomic resource for Sapotaceae, only the genome of Argania spinosa has been reported [64], so as the plastome sequence of Pouteria campechiana (Kunth) Baehni [65], Manilkara zapota (L.) P.Royen [66], and chloroplast genome of Lucuma nervosa [67], Vitellaria paradoxa, and Sideroxylon wightianum [68]. Since studies on M. pasquieri remain in their infancy, and information available from other plants is relatively limited, further research is needed.
TFs are important regulatory components for seed germination and plant development [69], and many TF families, including WRKY, MYB, NAC, and bHLH, have been studied extensively in model plants and crops [70], but fewer studied in non-model plants [71]. For example, members of the MYB (HORVU0Hr1G018970, HORVU2Hr1G010450) and NAC (HORVU2Hr1G077320) family were found associated with regulating germination or root development in barley (Hordeum vulgare) [72]. SPATULA, a member of bHLH, mediates seed germination by affecting cell elongation in Arabidopsis [73]. Here, 1058 and 2048 TF genes were identified by PacBio and Illumina, and were classified into 51 and 57 TF families, respectively. Moreover, we found that they have the same abundant TF families, including ERF, WRKY, GRAS, NAC, bHLH, C3H, bZIP, C2H2, and MYB_related (Table 1). This indicated that these TF families were actively involved in the material synthesis and growth metabolism of M. pasquieri during all stages, which requires further studies.

Excavation of KEGG Annotation Pathways Gene Annotation Information in M. pasquieri
A large number of transcripts from M. pasquieri were associated with metabolic pathways (3795), biosynthesis of secondary metabolites (2189), biosynthesis of antibiotics (1229), microbial metabolism in diverse environments (1081), and carbon metabolism (910), indicating that the germination and growth of M. pasquieri requires varied metabolic supports. This also shows that there are multiple functional metabolites in M. pasquieri, many of which may be of potential value. Although some pathways were associated with fewer transcripts, they may still be worth noting.
Previous studies have indicated that most phytohormones, such as ABA, GA, auxin, ethylene, cytokinine, brassinosteroid and jasmonic acid are involved in seed germination and growth regulation [74]. In this study, 349 PacBio transcript isoforms have been involved in plant hormone signal transduction pathway (ko04075 ; Table S5). Studies have shown that GA promotes seed germination, whereas ABA is the most notorious GA antagonist for its inhibitory effect on seed germination [75,76]. It has been reported that GA mainly stimulates germination by promoting radicle elongation and penetration of the seed coat [71], and GA-GID1 complex induces the degradation of the plant growth inhibitor DELLA proteins to promote plant germination [77]. In the study, 22 transcripts were involved in GA pathway, and nine and ten transcripts were annotated as GID1 and DELLA, respectively. These results might suggest that specific members of the GID1 and DELLA genes of M. pasquieri are involved in the regulation of seed germination. Furthermore, PYR/PYL (17), PP2C (10), SnRK2 (19), and ABF (11), associated with ABA pathway, were identified in our study, which have been proven to be key components of ABA signaling in sheepgrass (Leymus chinensis) [78]. Auxin is present in the seedling radicle tip during and after germination, and cytokinine is activated during germination [79]. And, we found AUX1 (8), TIR1 (5), AUX/IAA (26), ARF (12), GH3 (2), and SAUR (6) were involved in auxin pathway; CRE1 (1), AHP (2), B-ARR (3), and A-ARR (3) in cytokinine pathway of M. pasquieri. Although, these results might indicate that these specific transcripts were associated with the regulation of seed germination and post-germination in M. pasquieri, we could not obtain more accurate information in specific time, and tissues.
Other pathways, like carbon fixation in photosynthetic organisms (ko00710), photosynthesis (ko00195) pathways were also important in M. pasquieri, especially in post-germination stages. Notably, during cultivation of M. pasquieri, the leaves changed from a distinct red to a dark red, and finally to green between the two to the nine-leaf stage, which may be associated with anthocyanin biosynthesis pathway (ko00942) and 20 annotated transcripts were involved in the study. Furthermore, flavonoid biosynthesis pathway (ko00941), isoflavonoid biosynthesis pathway (ko00943), flavone and flavonol biosynthesis pathway (ko00944), terpenoid backbone biosynthesis (ko00900), sesquiterpenoid and triterpenoid biosynthesis (ko00909), monoterpenoid biosynthesis (ko00902), and diterpenoid biosynthesis (ko00904) pathways were also found, providing support for development and utilization of M. pasquieri.
Interestingly, 1229 and 1081 PacBio transcript isoforms have been involved in biosynthesis of antibiotics (ko01130) and microbial metabolism in diverse environments (ko01120), respectively, while none of transcript unigenes involved in de novo assembly from Illumina. On the one hand, this might be the differences between PacBio and Illumina platforms. And the sample used in SMRT sequencing were mixed, however in NGS de novo assembly were individual samples, which may filter some lowquality reads during assembly, resulting in different transcripts being obtained. On the other hand, previous studies have shown that the annotation rate of PacBio isoforms were much higher than that of the de novo unigenes [42,80]. And our results showed the same conclusion (96.31% versus 53.07%), suggesting that longer transcripts may be easier annotated. This may explain that transcript unigenes, involved in biosynthesis of antibiotics and microbial metabolism in diverse environments, were not annotated, which need further studies.

Conclusions
In conclusion, this was the first comprehensive transcriptome analysis of M. pasquieri combining SMRT and NGS sequencing. We identified 25,339 transcript isoforms by PacBio, including 24,492 CDSs, 9440 SSRs, and 149 lncRNAs. A total of 1058 transcripts were identified as TFs, which were classified into 51 TF families. Additionally, 182 AS events were detected across five types (A3, A5, AF, RI, and SE), among which a majority was IR. Although de novo assembly from Illumina obtained more unigenes (124,405) owing to its greater sequencing depth, PacBio Sequel recovered more FL transcripts, with a longer mean length and N50, longer CDSs, and higher expression level. Using four databases, 24,405 transcripts (96.31%) were annotated by PacBio, while 66,026 unigenes were annotated by de novo assembly, accounting for only 53.07% of the total, indicating that PacBio can more accurately annotate transcripts. And, we found that 8.2% of the de novo transcript unigenes exhibited similarity to 61.4% of the PacBio transcript isoforms, and that 91.8% unigenes and 38.6% isoforms were unique to the Illumina and PacBio database, respectively. Functional annotation revealed a role for the auxin, GA, ABA, and cytokinine metabolic pathways, which are associated with seed germination and post-germination. In addition, multiple flavonoid and terpenoid metabolic pathways have been identified, which may be related to the potential value of M. pasquieri. Moreover, we can combine the metabolomics and proteomics in the further research, so as to better understand the mechanism of germination and growth of M. pasquieri. Our work provides a comprehensive transcriptome resource for future studies on functional gene mining and utilization, genetic resource classification and evolution, molecular marker development, and endangered mechanism of M. pasquieri.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1999-4907/11/8/866/s1, Figure S1: Evaluation of PacBio and Illumina data by BUSCO software, Figure S2: Length distribution of unigenes obtained by de novo assembly, Figure S3: Functional annotations of unique transcripts in transcriptomes generated by Illumina and PacBio, Figure S4: The length distribution of Blast coding sequences (CDSs) in de novo assembly, Figure S5: The proportion of SSRs of different tandem repeat element types among the total SSR in de novo assembly, Figure S6: Annotated transcripts in 'plant hormone signal transduction' of KEGG pathways, Table S1: Statistics of the PacBio and de novo assembly data, Table S2: Statistics of 15 samples data filtering of RNA-Seq, Table S3: Base information statistics for 15 samples of RNA-Seq, Table S4: Statistics of simple sequence repeat (SSR) distribution in de novo assembly, Table S5