Gene Expression and Isoform Identification of PacBio Full-Length cDNA Sequences for Berberine Biosynthesis in Berberis koreana

Berberis koreana is a medicinal plant containing berberine, which is a bioactive compound of the benzylisoquinoline alkaloid (BIA) class. BIA is widely used in the food and drug industry for its health benefits. To investigate the berberine biosynthesis pathway, gene expression analysis was performed in leaves, flowers, and fruits at different stages of growth. This was followed by full-length cDNA sequencing analysis using the PacBio sequencer platform to determine the number of isoforms of those expressed genes. We identified 23,246 full-length unigenes, among which 8479 had more than one isoform. The number of isoforms ranged between two to thirty-one among all genes. Complete isoform analysis was carried out on the unigenes encoding BIA synthesis. Thirteen of the sixteen genes encoding enzymes for berberine synthesis were present in more than one copy. This demonstrates that gene duplication and translation into isoforms may contribute to the functional specificity of the duplicated genes and isoforms in plant alkaloid synthesis. Our study also demonstrated the streamlining of berberine biosynthesis via the absence of genes for enzymes of other BIAs, but the presence of all the genes for berberine biosynthesize in B. koreana. In addition to genes encoding enzymes for the berberine biosynthesis pathway, the genes encoding enzymes for other BIAs were not present in our dataset except for those encoding corytuberine synthase (CTS) and berbamunine synthase (BS). Therefore, this explains how B. koreana produces berberine by blocking the pathways leading to other BIAs, effectively only allowing the pathway to lead to berberine synthesis.


Introduction
The genus Berberis (Berberidaceae), also known as barberry, is a large genus containing approximately 500 species that are distributed throughout temperate and subtropical regions of the world [1,2]. The barberry plants are spiny, deciduous, evergreen shrub-or small tree-bearing berry fruits that are long and ripen red or dark blue [3]. Many Berberis species are known to have diverse phytochemicals, such as various alkaloids, terpenoids, flavonoids, sterols, anthocyanins, lignans, lipids, and carotenoids, so For our research, we conducted experiments to understand gene expression and isoform identification in berberine biosynthesis using different tissue types of B. koreana. Differential gene expression (DEG) analysis using Illumina short-read sequencing data showed differences in the gene expression encoding berberine biosynthesis among different tissues (i.e., leaf, flower, and fruit). Furthermore, we conducted full-length cDNA sequencing using the SMRT sequencing technique by PacBio to investigate comprehensive gene expression and isoform copy number in berberine biosynthesis. This article also demonstrated the end-to-end process from sequencing to contig assembly to functional annotation (from wet-lab technique to computational analysis).

Identification and Expression Analysis of Berberine Biosynthesis Pathway Genes
The BIA biosynthetic pathways start with two L-tyrosine molecules, which are converted to dopamine and 4-hydrophenylacetaldehyde ( Figure 1). These two molecules are combined by (S)-norcoclaurine synthase (NCS) to become (S)-norcoclaurine, an immediate precursor to all BIAs. (S)-Norcoclaurine is converted to (S)-reticuline via a series of enzymatic reactions [5]. (S)-Reticuline is then transformed into various BIAs, including berberine, noscapine, sanguinarine, morphine, etc. We identified all the enzymes crucial for the berberine biosynthesis pathway from L-tyrosine to berberine in the transcriptomes in both Illumina and PacBio Iso-Seq libraries. The isolated berberine biosynthesis enzymes were L-tyrosine aminotransferase  (Table S1). Gene families and isoforms of these enzymes are shown in Table 1.
As expected, the genes for enzymes required in the pathways synthesizing other BIAs were not identified in our dataset. For instance, transcripts of reticuline epimerase (REPI) for codeine and morphine synthesis, cheilanthifoline synthase/CYP719A25 for sanguinarine synthesis, norreticuline 7-O-methyltransferase (N7OMT) for papaverine synthesis, and tetrahydroprotoberberine N-methyltransferase (TNMT) for noscapine synthesis were absent in our study ( Figure 1). Although berbamunine and corytuberine were not precursors for berberine synthesis, and unigenes encoding the enzymes, berbamunine synthase (CYP80A1) and corytuberine synthase (CYP80G2) were present in our dataset.

Berberis Koreana Transcriptome Sequencing by PacBio
A pooled RNA sample of three leaf tissues was sequenced using the PacBio Sequel platform in two SMRT cells with size ranges of <4 kb and >4 kb. Table S1 shows a summary of the results. Figure 2 shows the steps in the computational pipeline used to obtain unique full-length transcripts in B. koreana. In the polymerase read, the total number of bases was over 63 Gbp (31 Gbp in the <4 kb library and 32.5 Gbp in the >4 kb library), which yielded 508,050 reads and 570,661 reads in the <4 kb and >4 kb libraries, respectively. In circular consensus sequence (CCS) analysis, we obtained 900,876 CCS reads, which consisted of 455,028 reads from 890 Mbp in the <4 kb library and 445,848 reads from 1.97 Gbp in >4 kb library, respectively. The mean read lengths were 1957 and 4434 bp in the respective libraries. With the aid of standard Iso-Seq classification and clustering protocol (see material and methods) on the obtained CCSs, we produced 43,635 and 32,996 highquality (HQ) polished transcripts and 261 and 430 low-quality polished transcripts in <4 kb and >4 kb groups, respectively (Table S1).
With the COding GENome reconstruction Tool (Cogent v7.0.0, https://github.com/ Magdoll/Cogent (accessed on 3 May 2021)), the HQ coding sequences (76,631) were further analyzed to generate 19,902 reads from reconstructed coding contigs of 60.49 Mbp. The number of reads in unassigned sequence was 3344. In UniTransModels, the fake genome comprised reconstructed coding contigs and unassigned sequences. Thus, the number of reads in the fake genome was 23,246, and they varied in length from 100 to 13,544 bp with a mean of 3059 bp. Figure 3 shows the size distribution of the reads, with the most frequent length falling between 1000 and 1999 bp, followed by a 4000-4999 bp range.  Of these 23,246 reads, 14,767 unigenes had no isoform, making up 64.21% of the total transcripts. Of the remaining 8479 unigenes, 4812 (20.92%) produced 2 isoforms, followed by 1680 (7.30%) producing 3 isoforms and 728 (3.17%) producing 4 isoforms ( Table 2). For isoform analysis, we analyzed in detail the unigenes involved in berberine biosynthesis. Figure 4 shows (S)-norcoclaurine synthase (NCS) transcripts exhibiting ten isoforms. NCS catalyzes the formation of norcoclaurine from dopamine and 4-hydroxyphenylacetaldehyde in the berberine synthesis pathway [5].

Functional Annotation
The B. koreana transcripts were functionally annotated and categorized by mapping against NCBI databases. Of the 23,246 unigenes, 20,029 (86.16%) and 16,179 (69.9%) were matched with the Nt and Nr databases, respectively. Of those 20,029 unigenes matched to the Nt database, the most unigenes were matched with those of Nelumbo nucifera (3006), followed by Papaver somniferum (1554) and Camellia sinensis (873) ( Figure S1). A similar trend was followed in the Nr database, where the most were annotated to N. nucifera (15,082), followed by P. somniferum (4412); in contrast, the third and fourth most transcripts were matched to Vitis vinifera and Vitis riparia ( Figure S1).
In functional classification, GO terms were assigned to each of the UniTransModels via the BLAST2GO (bioinformatics platform) program based on annotation of the Nr database. Overall, 16,756 (53.18%) unigenes were assigned into the three major classification categories: biological process, molecular function, and cellular component ( Figure 5). In the biological process category, there were five major representative subgroups with over 10,000 transcripts: cellular process (GO:00099871), metabolic process (GO:0008152), response to stimulus (GO:0050896), biological regulation (GO:0065007), and regulation of biological process (GO:0050789). In the molecular function category, two subgroups, binding (GO:0005488) and catalytic activity (GO:0003824), were predominant. In the cellular component category, only two subgroups were present: cellular anatomical entity (GO:0110165) and protein-containing complex (GO:0032991). For exploration of biological functions and interactions, B. koreana UniTransModels were queried against the Kyoto Encyclopedia of Genes and Genomes database (KEGG). In total, 12,362 transcripts were found in the 151 KEGG pathways, which involved five functional categories ( Figure S2, Table 3). Among these pathways, the "metabolism" pathway included the most transcripts, representing 93% of the data sets. In metabolism, "metabolism of cofactors and vitamins", "carbohydrate metabolism", "nucleotide metabolism", and "amino acid metabolism" were the most abundant unigenes ( Figure S3; Table 3).

Isoforms in Berberine Biosynthesis
Of the 23,246 unigenes identified, 14,767 (64.21%) did not have isoforms, and the rest produced isoforms numbering from 2 to 31 (Table 2), with an average of 15 isoforms per gene. O-methyltransferase and oxidases are major enzyme families in BIA synthesis [5]. In our dataset, the number of unigenes was 46 and 139 for O-methyltransferases and CYP oxidases, respectively (Table S2). Among the 15 genes involved in berberine biosynthesis, the number of paralogues per gene ranged from 1 to 22, such that BBE, STOX, and SOMT had no paralogue, whereas 3OHase had 22 paralogues ( Table 1). The number of isoforms ranged from 1 to 10, with an average of 3.6 per gene. Figure 4 shows the structures of isoforms of PB10989, which is one of the 11 paralogues of the NCS gene family. The number of exons in PB10989 ranged from one (PB10989.4) to four (PB10989.3, PB10989.7, PB10989.8, PB10989.9).
Phylogenetic analysis of the NSC families in B. koreana and NSC enzymes from other species producing BIAs in Ranunculales and N. nucifera in Proteales was performed ( Figure S4). For phylogenetic analysis, the longest isoforms were used if more than one isoform was present in B. koreana. The eleven isoforms clustered into two major clusters: one was exclusive to B. koreana NCSs, and the other cluster included NCSs of other species. The exclusive B. koreana cluster included the isoforms PB21974.1, PB7426.1, PB15604.1, PB10989.1, and PB15603.2. The second major cluster was divided into several small groups, with PB.9306.1 being closely placed with N. nucifera NCS. PB8681.1, PB8683.1, and PB8682.1 were closest to Thalictrum thalictroides and Sinopodophyllum hexandrum NCSs. The isoform PB11959.1 tied with others as a solo isoform at a single node.

Phylogenetic Analyses of the Methyltransferases and Oxidase/Reductases among the Berberine Synthesis Enzymes
We identified 46 O-methyltransferases and 139 CYP450 oxidase-encoding unigenes in our dataset (Table S3). Among these, the enzymes involved in the berberine biosynthesis pathway were analyzed for their phylogenetic relationships with homologues in other species in Ranunculales and N. nucifera ( Figure S5). In the B. koreana transcriptome dataset, four O-methyltransferases were recognized in the berberine biosynthetic pathway. The numbers of unigenes encoding O-methyltransferases were 2, 2, 1, and 4 in 6OMT, 4OMT, SOMT, and CNMT, respectively. Each of the O-methyltransferases formed a distinct clade with homologues from other species with high bootstrap values. The CNMT clade was out-formed from the large cluster of 6OMT, 4OMT, and SOMT with low bootstrap values. In oxidases/reductases, the overall tree showed two large clusters: one with CYP oxidases and another with BBE and STOX ( Figure 6). In the cluster of CYPs, two subclusters were formed, one with CYP719A21/CA and another with CYP80A, CYP80G, and CYP80B/NMCH. In the CYP80 subclade, the CYP80B/NMCH members separated from those of CYPA and CUP80G. The numbers of unigenes for CYP719A21/CAS, CYP80A, CYP80G, CYP80B/NMCH, BBE, and STOX were 3, 2, 4, 2, 1, and 1, respectively, and each of these enzymes also formed a distinct clade with homologues from other species except for two CYP80A enzymes. The large cluster of BBE and STOX was robust with a bootstrap value of 100.

Discussion
De novo Iso-seq transcriptome sequencing analysis was carried out with the publicly available bioinformatics tool Cogent in the medicinal plant B. koreana. Cogent is specifically designed to use in cases where there is no reference genome available [22,23]. With the lack of a reference genome in B. koreana, we built a bioinformatics pipeline using Cogent for de novo genome assembly and annotation of transcriptomes using high-quality Iso-Seq data ( Figure 2). Our analysis proved that this pipeline could be applied to characterize full-length transcriptomes in any other species lacking a reference genome.
PacBio Iso-Seq allows reading long transcripts with relatively low error rates because the reads are derived from the results of multiple sequencing of circular cDNA in SMRT cells [18,21]. In our analysis, the average length of the transcripts was 3059 bp, and the longest read was 13,544 bp. Long reads have several advantages, including better chances for Basic Local Alignment Search Tool (BLAST) matching with annotated databases, identification of isoforms, and identification of gene families. In the current study, 86.16% (20,029) and 69.9% (16,179) of the 23,246 identified unigenes matched with the Nt and Nr databases, which are lower than the percentages of matches for ginseng (95.8%) [20]. This suggests that there are still many novel or uncharacterized transcripts in B. koreana. Similar results have been reported in another medicinal herb, Z. planispinum [21]. The higher matching of the unigenes of B. koreana with lotus (N. nucifera) than with opium poppy (P. somniferum) unigenes was odd because the genera Berberis and Papaver belong to the order Ranunculales, but the genus Nelumbo belongs to the order Proteales. The species in Ranunculales produce a variety of alkaloids [24], and opium poppy produces morphine, codeine, and noscapine, but not berberine [4,5]. SMRT analysis of C. deltoidea, a plant of Papaveraceae in Ranunculales, also revealed that 29.01% of the full-length transcripts showed significant matching with those of lotus [7]. Lotus produces over 200 natural compounds, including alkaloids, glycosides, flavonoids, and terpenoids [25], and genes involved in the BIA biosynthetic pathway were thoroughly investigated for their transcriptional regulation by transcriptome analysis. Our GO analysis results corroborated results for the alkaloid-producing medicinal plants Z. planispinum [21] and C. deltoidea [14], specifically that the majority of the transcripts were in biological processes with cellular processes and metabolic processes being dominant. The KEGG results of B. koreana revealed that a high number of transcripts were assigned to the metabolism pathways of vitamins and cofactors, carbohydrates, nucleotides, and amino acids, which is meaningful because various phytochemicals are derived from these molecules. For example, BIAs are derived from an amino acid tyrosine, which is congruent with the results of a transcriptome study of Z. planispinum [21].
Gene duplications are common in eukaryotic species, and genes for secondary metabolism are often present in families that are derived from gene duplication [26]. The duplicated genes were diversified into new pathways in synthesizing secondary metabolites, including BIAs [7,27,28]. In lotus, CYP80 was duplicated into two copies, NnCYP80A and NnCYP80G, that play important roles in two groups of alkaloids, aporphine-type BIAs and bis-BIAs [7]. He et al. in 2018 [28] reported similar results in Coptis species, where BIA synthesis genes were present mostly in families, but the copy numbers varied for some genes between Coptis teeta and Coptis chinensis. In the current study, thirteen of the sixteen genes encoding berberine synthesis enzymes were present in more than one copy in B. koreana. The sizes of family genes were comparable with those of genes in the Coptis species [28]. The 3OHase gene had 22 copies in B. koreana, whereas it had six to seven copies in Coptis. A notable difference was BBE, which was present in only a single copy in B. koreana, but eight copies were present in Coptis species. BBEs form a subgroup of superfamilies of FAD-linked oxidases that catalyze the conversion of (S)-reticuline to (S)-scoulerine, which is a common precursor for berberine, jatrorrhizine, coptisine, and other BIAs [29]. The presence of diverse BIAs in Coptis species might have resulted in the expansion of the BBE genes. Gene family size variation shapes natural variation for adaptation in various plant species [30]. For instance, the gene encoding the DBOX enzyme, the last catalytic enzyme in the sanguinarine pathway, was expanded to 35 copies in the medicinal plant Macleaya cordata, but the number of GBOX gene copies was only one in grapevine and ten in Arabidopsis [31]. We also demonstrated that the duplicated copies in each gene diversified and adapted to be differentially expressed between tissues or in growth stages, which was evidenced by other genes involved in plant secondary metabolism [26]. In lotus, most of the BIA synthesis genes showed higher expression in leaf tissues of a high-BIA cultivar than in a low-BIA cultivar, also revealing stage-specific expression [7]. Oxidase/reductases and methyltransferases are major catalytic enzymes in BIA biosynthesis. The cytochrome P450s (CYPs) are an enzyme superfamily present in all kingdoms of life [32], and we identified 124 copies of CYP450 oxidases in the B. koreana transcripts. Three CYP subfamilies, CYP80, CYP82, and CYP719, were identified in the BIA biosynthetic pathway [5]. CYP80 and CYP719 subfamilies were present, but CYP82 was absent in B. koreana transcripts. CYP82 is involved in the synthesis pathway of noscapine and sanguinarine [5], but B. koreana does not synthesize these BIAs. Three members of the CYP80 subfamily were identified in the BIA biosynthetic pathway: CYP80G2, CYP80B3, and CYP80A1. We found seven unigenes encoding CYP80, and they were further classified as CYP80A (PB5792.2), CYP80B (PB21419.1 and PB5745.1), and CYP80G (PB6475.1, PB6474.1, PB10724.1, PB10725.1). CYP80B3 is synonymous with (S)-N-methylcoclaurine 3 -hydroxylase/N-methylcoclaurine 3 -monooxydase (NMCH), which catalyzes the C-3 hydroxylation of (S)-N-methylcoclaurine to produce (S)-3 -hydroxy-N-methylcoclurine in the berberine synthesis pathway [27]. CYP80A is a berbamunine synthase that converts (S)-N-methylcoclaurine to berbamunine [33]. CYP80G is an (S)-corytuberine synthase that converts (S)-reticuline to corytuberine [34]. In phylogenetic analysis (Figure 6), the CYP80A and CYP80G members clustered with the CAS/CYP719 clade, which catalyzes the conversion of (S)-tetrahydrocolumbamine to (S)-canadine [35]. There were three unigenes (PB4732.1, PB4733.1, PB6044.1) encoding CAS/CYP719A2 in the transcriptomes of B. koreana.
One of the key findings of the genomics studies is that the number of protein coding genes is much less than expected. In addition, the number of genes does not vary greatly between developmentally complex organisms and simple organisms, which was proposed as the G-value paradox [40]. Alternative splicing can partially account for the low number of protein-coding genes and the G-value paradox. SMRT sequencing can analyze the isoforms from alternate splicing in plant transcriptomes (An et al. 2018). Approximately 35.8% of the unigenes had isoforms in the transcript dataset in B. koreana, which is higher than the number from Iso-Seq analysis (17.6%) of Z. planispinum [21]. The average 1.7 isoforms per gene is lower than the average 3.93 isoforms per gene in cotton [41] and 6.56 isoforms per gene in maize [18]. TYDC, 3OHase, NCS, and COR exhibited the highest number of paralogues and isoforms among the genes involved in berberine biosynthesis ( Table 1, Table S1). Plants produce a high number of non-functional isoforms to avoid the hostile effects and metabolic cost that occur under the effects of stress when functional proteins provide resistance to plants [42]. Producing multiple alternative splicing sites and producing complete protein saves time and energy in transcriptional activation and the accumulation of necessary mRNAs [43].
The biosynthetic pathway of BIAs has been well established in the genera Papaver [5], Coptis [14,44], and Nelumbo [7]. However, no report is available for the berberine pathway in B. koreana, and our report is the first transcriptome study in the genus Berberis. In the current study, the genes encoding enzymes for the pathway of berberine biosynthesis were identified, but the genes for enzymes in the pathways for other BIAs were not present in our dataset except for corytuberine synthase (CTS) and berbamunine synthase (BS) (Figure 1). (S)-Norcoclaurine, an immediate precursor for all BIAs, is converted to berberine by eight sequential biochemical reactions in which seven intermediate protoberberine molecules are involved. Of the seven protoberberine molecules, some are precursors for the branching pathway for other BIAs, and these branching pathways are blocked by the absence of the genes encoding appropriate enzymes. For instance, (S)-coclaurine is a precursor molecule for either papaverine synthesis or (S)-reticuline [45]. However, the pathway leading to papaverine is blocked by the absence of the gene encoding N7OMT in B. koreana. (S)-reticuline can be converted into either (S)-scoulerine by BBE [46] or (R)-reticuline by reticuline epimerase (REPI) [5]. (R)-reticuline is a precursor for the synthesis of codeine and morphine [5]. In our dataset, BBE was present, but REPI was absent, so codeine and morphine were not synthesized in B. koreana. (S)-canadine can be converted to either berberine by STOX or (S)-N-methylcanadine by TNMT to lead to noscapine synthesis [47]. TNMT was absent in the transcriptome dataset so that this pathway was blocked, and the synthesis of berberine could proceed by STOX in B. koreana. Thus, our results clearly demonstrated how B. koreana produces berberine by blocking the pathways leading to other BIAs, but effectively allowing the pathway leading to berberine synthesis. In conclusion, the first research about BIA biosynthesis pathway at the transcriptome level in B. koreana was presented. It demonstrated distinctive characteristics in difference between B. koreana and others such as lotus. We believe this study can be a fundamental resource for further research to understand berberine as a bioactive compound for food and drug applications.

Plant Material and Storage
Five tissue samples (young and mature leaves, flower, young, and mature fruits) of B. koreana were obtained from a stand planted at the experimental garden of Hallym University, which was originally collected at Kangwon province of Korea. Collected tissues were immediately frozen with liquid nitrogen and stored at −80 • C until use.

Illumina RNA-Seq Library Construction and Sequencing
The total RNA was extracted using Hybrid-R RNA extraction kit (Geneall Biotechnology, Seoul, Korea) according to the manufacturer's protocol. For Illumina library construction, extracted RNA from five different tissues with high RNA integration number (RIN) values (>8) were used for cDNA synthesis. The cDNA was sheared with an average of 500 bp fragment sizes. The TruSeq Library Preparation Kit (Illumina Inc., San Diego, CA, USA) was used to construct the DNA library according to the manufacturer's protocol. The DNA libraries were sequenced with 150-bp paired-end sequencing using an Illumina Hiseq2500. The quality of the constructed libraries was confirmed by a LabChip GX system (PerkinElmer, Waltham, MA, USA).

Differential Gene Expression (DEG) Analysis
Illumina reads were aligned using Bowtie 2 v2.4.2 [48]. Using RSEM (v1.1.12) [49], the read count values were directly obtained and converted to fragments per kilobase of transcript per million mapped reads (FPKM) values. Then, the DEGs between different tissue samples (leaf vs. flower, young leaf vs. mature leaf, and young fruit vs. mature fruit) were detected with the standardization method TMM using edge R [50]. The significant DEGs were screened at false discovery rates (FDRs) <0.05 and absolute log2fold change values >0.01.

Full-Length cDNA Sequencing
The same amount of total RNA from five tissues (flower, young leaf, mature leaf, young fruit, and mature fruit) were pooled and subjected to RNA quality check (Agilent Technologies, Santa Clara, CA, USA), followed by cDNA synthesis for full length cDNA sequencing analysis. The cDNA size selection was performed through a BluePippin (Sage Science, Beverly, MA, USA) to build a cDNA library of size <4 kb and >4 kb cDNA. Iso-Seq library preparation and sequencing were done using PacBio full-length cDNA library and sequencing kit according to the manufacture's protocol (Pacific Biosciences of California, Inc., Menlo Park, CA, USA) in sequencing service provider, Theragen Bio. (Theragen Bio Inc., Sungnam, Korea).

Iso-Seq Data Processing with a Standard Bioinformatics Pipeline
Raw sequencing data processing was performed by a standard Iso-Seq protocol in SMRTlink4.0 software. Polymerase reads <50 bp were removed, and the obtained subread BAM files were processed into error-corrected circular consensus (CCS) using the following parameters: full passes ≥0 and predicted consensus accuracy >0.75. By identifying the 5 -and 3 -adapters and the poly (A) tail, full-length and non-full-length reads were classified as full-length and non-full-length. CCSs with all 5 -and 3 -reads were referred to as non-full-length reads, whereas those with all three elements that did not contain any additional copies of the adapter sequence within the DNA fragment were referred to as full-length non-concatemer (FLNC) reads. FLNC reads were clustered into consensus sequences using the Iterative Clustering for Error Correction (ICE) algorithm (https://www.pacb.com/products-and-services/analytical-software (accessed on 3 May 2021)). These reads were combined with non-full-length transcripts and were further polished in clusters using Quiver [51] using the parameters hq_quiver_min_accuracy 0.99, bin_by_primer false, 300 bin_size_kb 1, qv_trim_5p 100, and qv_trim_3p 30 to obtain full-length high-quality (HQ; above 99% accuracy) and low-quality (LQ) polished consensus sequences.

Full-Length Unique Transcript Model Reconstruction
Error-corrected HQ and LQ full-length polished consensus transcripts were merged to remove redundancy using the CD-HITv4.6 package with the parameters -c 0.99-G 0-aL 0.00-aS 0.99-AS 30-M 0-d 0-p 1 [52]. The non-redundant transcripts were processed with the Coding GENome reconstruction Tool (Cogent v7.0.0, https://github.com/Magdoll/Cogent (accessed on 3 May 2021)). In general, Cogent first creates the k-mer profile of nonredundant transcripts, calculates pairwise distances, and then clusters transcripts into families based on their k-mer similarity. Each transcript family was further reconstructed into one or several unique transcript model(s) (referred to as UniTransModels) using a De Bruijn graph method.

Isoform Identification
Error-corrected non-redundant transcripts (transcripts before cogent reconstruction) were mapped to UniTransModels using Minimap2 v2.6 [53]. Splicing junctions for transcripts mapped to the same UniTransModels were examined, and transcripts with the same splicing junctions were collapsed using Cupcake ToFU v13.0.0 [21]. Collapsed transcripts with different splicing junctions were identified as transcription isoforms of UniTransModels.

Functional Annotation and Phylogenetic Analysis
To obtain annotation information of unigenes, transcripts were mapped into various databases. We compared non-redundant protein sequences (Nr) and NCBI nonredundant nucleotide sequences (Nt) against the NCBI database by BLAST v2.10.1 with an E-value cut-off of 1e-5. The Kyoto Encyclopedia of Genes and Genomes (KEGG) and Genome Ontology (GO) analyses were carried out by BLAST2GO v5.2.5 with an E-value cut-off of 1e-5. Phylogenetic analysis was performed using the protein sequences of genes in-volved in berberine biosynthesis. The genes were aligned with the same genes of closely related families, and a maximum likelihood phylogenetic tree was built using MEGA X (v.10.2.4) [54].
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/plants10071314/s1, Table S1. Berberine pathway biosynthetic enzyme genes from B. koreana unigenes; Table S2. Pacbio summary of RNA-Seq data from two RNA libraries of B. koreana; Table S3. List of CYPs and O-methyl transferases listed in B. koreana; Figure S1. Species distribution of top BLAST search hits for the assembled transcripts in Nt and Nr databases; Figure S2. Enrichment analysis of the KEGG pathways assigned to the Berberis koreana unigenes. Distribution of the B. koreana unigenes in KEGG biological categories; Figure  Data Availability Statement: All data were submitted to the National Centre for Biotechnology Information (NCBI) under SRA number PRJNA705056.

Conflicts of Interest:
The authors have no conflicts of interest to declare.