A High-Quality Reference Genome Assembly of Prinsepia uniflora (Rosaceae)

This study introduces a meticulously constructed genome assembly at the chromosome level for the Rosaceae family species Prinsepia uniflora, a traditional Chinese medicinal herb. The final assembly encompasses 1272.71 megabases (Mb) distributed across 16 pseudochromosomes, boasting contig and super-scaffold N50 values of 2.77 and 79.32 Mb, respectively. Annotated within this genome is a substantial 875.99 Mb of repetitive sequences, with transposable elements occupying 777.28 Mb, constituting 61.07% of the entire genome. Our predictive efforts identified 49,261 protein-coding genes within the repeat-masked assembly, with 45,256 (91.87%) having functional annotations, 5127 (10.41%) demonstrating tandem duplication, and 2373 (4.82%) classified as transcription factor genes. Additionally, our investigation unveiled 3080 non-coding RNAs spanning 0.51 Mb of the genome sequences. According to our evolutionary study, P. uniflora underwent recent whole-genome duplication following its separation from Prunus salicina. The presented reference-level genome assembly and annotation for P. uniflora will significantly facilitate the in-depth exploration of genomic information pertaining to this species, offering substantial utility in comparative genomics and evolutionary analyses involving Rosaceae species.

The rapid advancement of genome sequencing technologies has substantially reduced the challenges associated with assembling high-quality genomes, offering promising prospects for precise plant breeding [5].To date, genome sequencing efforts have successfully covered a substantial number of Rosaceae species, with most of the genome sequences accessible through the Genome Database for Rosaceae (GDR) [6].As of September 2023, the GDR database houses a total of 128 genome assemblies, encompassing 11 genera and Genes 2023, 14, 2035 2 of 10 59 Rosaceae species.These genomic datasets, complemented by other omics data, have greatly accelerated the breeding programs for Rosaceae plants.
P. uniflora, a deciduous shrub primarily found in Northwest China at altitudes ranging from 900 to 1100 m, has received notably less attention compared to other prominent members of the Rosaceae family.However, this plant possesses substantial medicinal value.The kernels of P. uniflora, referred to as 'ruiren' in China, have traditionally been used to cure eye conditions in traditional Chinese medicine [7].Unfortunately, limited omics data have been published for P. uniflora thus far, impeding comprehensive genomic analyses of the biosynthetic pathways of its medicinal components.Recent phylogenetic investigations within the Rosaceae family have revealed that the genus Prinsepia is situated within the Exochordeae clade, closely related to Kerrieae and Sorbarieae [8,9].The development of a high-quality genome assembly and annotation for P. uniflora will also prove invaluable for phylogenetic analyses, investigations into polyploidization events, and the study of karyotype evolution within Rosaceae.
In this study, we employed cutting-edge PacBio high-fidelity (HiFi) sequencing technology to assemble a high-quality genome of P. uniflora.Subsequently, we anchored this assembly onto chromosomes using high-throughput chromosome conformation capture (Hi-C) data.Leveraging this reference-level P. uniflora assembly, we conducted comprehensive annotation of repetitive sequences, protein-coding genes, and non-coding RNAs (ncRNAs) utilizing diverse bioinformatics methodologies.The presentation of this highquality P. uniflora genome assembly and its associated annotation will provide the scientific community with a valuable genomic asset, facilitating the in-depth exploration of P. uniflora's genomic information and supporting genetic and genomic inquiries within the Rosaceae family.

Sample Collection
Leaf specimens for genome sequencing were obtained from a 2-year-old individual P. uniflora plant located in the Yinchuan Botanical Garden, Ningxia Province, Northwestern China.Fresh samples of leaves, stems, flowers, and roots were harvested from the same plant in its flowering period.Subsequently, all plant materials were immediately flashfrozen in liquid nitrogen before DNA or RNA extraction.

Genome Survey
We performed the extraction of total genomic DNA utilizing the cetyl trimethylammonium bromide (CTAB) method.The NEBNext Ultra II DNA Library Prep Kit (New England Biolabs, MA, USA) was used to create paired-end Illumina ReSeq libraries with an average insert size of 400 bp.These libraries were then subjected to sequencing on an Illumina NovaSeq 6000 platform (Illumina Inc., San Diego, CA, USA).Following this, we estimated the haploid genome size and heterozygosity rate of P. uniflora by analyzing the k-mer (k = 19) distribution frequency of Illumina sequencing reads, employing Jellyfish v2.2.9 [10].

PacBio HiFi Sequencing and Assembly
A modified CTAB approach was used to extract high-molecular-weight DNA.We generated HiFi reads using the PacBio Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA) in circular consensus sequencing (CCS) mode, adhering to the PacBio 15 kb protocol to simplify PacBio HiFi sequencing.The resulting HiFi reads underwent preprocessing via the CCS analysis workflow within SMRT Link v8.0 (PacBio) and were subsequently assembled into a contig-level assembly employing hifiasm v0.14 [11].Using Purge Haplotigs v1.1.1,possible duplicate haplotypes were found and eliminated from the original assembly.[12].Additionally, we mapped Illumina reads back to the polished assembly using BWA v0.7.17 [13].Pseudo-contigs with exceptionally low coverage depth (<5×) and high GC content (>50%) were excluded.

Hi-C Sequencing and Scaffolding Analysis
Hi-C libraries were created via chromatin extraction and digestion, DNA ligation, and purification, all in accordance with a predetermined protocol [14].Then, these libraries were sequenced using an Illumina NovaSeq 6000 platform.Using Juicer v1.8.8 [15], Hi-C paired-end reads were aligned back to the contigs, preserving uniquely mapped Hi-C reads.Finally, the contigs were anchored into pseudochromosomes using the 3D-DNA program [16].

RNA Sequencing
For RNA sequencing (RNA-seq), we isolated total RNA from fresh leaf, stem, flower, and root tissues of P. uniflora using TRIzol reagent.After eliminating residual DNA, RNA-seq libraries were generated using the NEBNext Ultra II RNA Library Prep Kit and sequenced using an Illumina NovaSeq 6000 platform.The resulting RNA-seq reads were subjected to filtration using Trimmomatic v0.36 [17] prior to transcriptome-based gene prediction.

Genome Assessment
We assessed the completeness of both the genome assembly and the protein-coding gene set by calculating the benchmarking universal single-copy orthologs (BUSCO) completeness score.This analysis employed BUSCO v3.0.2 [18] and utilized the Embryophyta odb10 dataset.Furthermore, we computed long terminal repeat (LTR) assembly index (LAI) scores using LTR_retriever v2.8 [19] with a sliding window size of 3 Mb.

Repeat Annotation
We generated a de novo repeat library based on the genome assembly using Repeat-Modeler v2.0.1 [20].Later on, this library was merged with the green plant repeat library from the Repbase database version 22.11 [21].Finally, we conducted homology-based detection of repetitive elements using RepeatMasker v4.1.0[22].

Gene Prediction and Functional Annotation
Based on the repeat-masked assembly, we used a combination of homology-based searches, de novo prediction, and transcript alignment to annotate protein-coding genes.First, using Program to Assemble Spliced Alignment (PASA) v2.3.3 [26], transcripts were assembled using RNA-seq data, and the alignment findings were used to estimate gene architectures.
Whole-genome duplication (WGD) analysis was carried out based on all-against-all pairwise comparisons of protein sequences using DIAMOND v0.9.22 [52].MCScanX v1.1 was used to identify syntenic gene pairings both within and between genomes [53].Each gene pair's synonymous substitution rate (Ks) was determined using the MCScanX script "add_ka_and_ks_to_collinearity.pl.".By examining the Ks distributions of orthologous and paralogous gene pairs within and between species, the occurrence of WGD events was investigated.

Genome Sequencing and Assembly
Prior to assembly, we generated a substantial 91.20 gigabases (Gb) of Illumina pairedend reads for k-mer analysis (Table S1).Subsequently, we estimated the haploid genome size of P. uniflora to be 1.33 Gb, with a heterozygosity rate of 0.41% (Figure S1).Given the low heterozygosity of the P. uniflora genome, we chose to assemble a collapsed assembly instead of haplotype-resolved assemblies in this study.We then generated a total of 26.77 Gb of PacBio HiFi reads (number of reads = 1,639,479; N50 = 16.14 kb) for the de novo assembly of the P. uniflora genome (Figure S2).This effort yielded 3672 contigs covering 1.68 Gb with an N50 of 2.2 Mb (Table S2).After excluding potential duplicate haplotypes and pseudo-contigs, we retained 820 contigs, totaling 1.27 Gb, for subsequent Hi-C scaffolding analysis (Table S2).Ultimately, 99.84% of the entire genome assembly was anchored onto 16 pseudochromosomes (Figure 1a), utilizing 106.49Gb of Hi-C paired-end reads.The pseudochromosomes varied in length, ranging from 67.28 Mb to 96.32 Mb (Table S3).This pseudochromosome number aligns with findings from a previous karyotype study of the Rosaceae family [54].We observed an overall GC content of 41.68% for the final assembly (Figure 1b), with contig and super-scaffold N50 values of 2.77 and 79.32 Mb, respectively (Tables 1 and S4).The BUSCO completeness score for the final assembly reached 97.03%, encompassing 51.36% single-copy and 45.66% duplicated BUSCOs (Table S5).Additionally, we noted an overall LAI score of 22.86 (standard deviation = 5.15) for the entire genome, indicating a high level of completeness (Figure S3).S5).Additionally, we noted an overall LAI score of 22.86 (standard deviation = 5.15) for the entire genome, indicating a high level of completeness (Figure S3).

Repeat Annotation
Within the final assembly, we identified a total of 875.99 Mb of repetitive sequences, constituting 68.83% of the P. uniflora genome (Table S6).Transposable elements (TEs) constituted the majority of repeat sequences, spanning 777.28 Mb or 61.07% of the entire genome (Figure 1b).This was followed by 113.34 Mb of unclassified repeats and a smaller

Repeat Annotation
Within the final assembly, we identified a total of 875.99 Mb of repetitive sequences, constituting 68.83% of the P. uniflora genome (Table S6).Transposable elements (TEs) constituted the majority of repeat sequences, spanning 777.28 Mb or 61.07% of the entire genome (Figure 1b).This was followed by 113.34 Mb of unclassified repeats and a smaller proportion of other repeat classes, including satellites (2.72 Mb), simple repeats (1.11 Mb), and low-complexity repeats (205 bp).LTR-RTs accounted for 94.64% (735.58Mb) of the TE sequences, with a Gypsy/Copia ratio of 2.37.Furthermore, we identified 13,071 full-length LTR-RTs with a cumulative length of 117.78 Mb within the P. uniflora genome.This included 6295 Gypsy elements, 2039 Copia elements, and 4737 unclassified LTRs (Figure S4).

Gene Prediction and Functional Annotation
We generated a comprehensive dataset of 27.95 Gb of RNA-seq data for transcriptomebased gene prediction (Table S7), which resulted in the identification of 69,894 gene models.These gene models were amalgamated with predictions from homology-based and de novo approaches (Table S8), culminating in a final consensus gene set encompassing 49,261 protein-coding genes, achieving a BUSCO completeness score of 97.34% (Table S5).The average transcript length of the estimated genes was 3816 bp, the average coding sequence (CDS) size was 1441 bp, each gene had an average of 5.3 exons, and the average intron length was 557 bp (Table 1 and Figure S5).The 16 pseudochromosomes contained 99.77% of the genes (49,148), resulting in an overall gene density of 39.7 genes per Mb (Table S3).Among these protein-coding genes, 2373 (4.82%) were identified as transcription factor (TF) genes, with the bHLH, MYB, and ERF families being the three largest TF families (Table S9).A total of 45,256 (91.87%) genes were assigned functional annotations from public databases, and 14,696 (29.83%) genes were attributed to GO terms (Table S10).Additionally, we identified 5127 (10.41%) tandemly duplicated genes distributed across 1952 arrays within the P. uniflora genome (Table S11).These genes were significantly enriched in processes related to "obsolete oxidation-reduction", "metabolism", "defense response", and "transmembrane transport" (Figure S6).Furthermore, we annotated a total of 3080 ncRNAs spanning 0.51 Mb of genome sequences, encompassing 957 transfer RNAs (tRNAs), 181 microRNAs (miRNAs), 1216 ribosomal RNAs (rRNAs), and small nuclear RNAs (snRNAs) (Table S12).Collectively, these annotated gene sets will substantially facilitate further genomic and genetic studies of P. uniflora.
3.4.Evolutionary History of P. uniflora P. uniflora and eight other Rosaceae species were found to have 118 single-copy orthogroups according to our OrthoFinder investigation.Based on these single-copy orthogroups, a well-supported species tree was recovered, showing that P. uniflora and P. salicina were clustered together with nearly full (97%) bootstrap support (Figure 2a).The divergence time between P. uniflora and P. salicina was estimated to be around 37.6 Mya, which was slightly later than the divergence between their common ancestor and the clade formed by Maleae and Gillenieae (44.9 Mya).We observed a major peak around 0.29 in the Ks distribution of orthologues between P. uniflora and P. salicina, and a younger peak around 0.12 in the paralog analysis of P. uniflora (Figure 2b).Therefore, we speculated that an independent WGD event had occurred in the P. uniflora genome after its split from P. salicina, which may have contributed significantly to its larger haploid genome size (1.27Gb vs. 284 Mb) and larger number of protein-coding genes (49,261 vs. 27,481) compared to P. salicina.According to the Plant DNA C-values Database, another Exochordeae species-Exochorda giraldii-has eight haploid chromosomes with a haploid genome size of smaller than 600 Mb.Therefore, this recent WGD event likely occurred in P. uniflora after its split from E. giraldii.
Notably, there are some differences between our species tree and the nuclear phylogeny presented by Xiang et al. (2017) [8].In the previous phylogeny, P. uniflora was more closely related to the clade formed by Maleae and Gillenieae.However, our nuclear genome-based phylogeny seems more reasonable because both P. uniflora and P. salicina had chromosome numbers that are multiples of 8, while all Maleae species had 17 chromosomes and Gillenieae had 9 (Figure 2a).In the future, more high-quality genome sequences of Rosaceae species will help us further elucidate the phylogenetic relationships and evolutionary history of Rosaceae.Notably, there are some differences between our species tree and the nuclear phylogeny presented by Xiang et al. (2017) [8].In the previous phylogeny, P. uniflora was more closely related to the clade formed by Maleae and Gillenieae.However, our nuclear genome-based phylogeny seems more reasonable because both P. uniflora and P. salicina had chromosome numbers that are multiples of 8, while all Maleae species had 17 chromosomes and Gillenieae had 9 (Figure 2a).In the future, more high-quality genome sequences of Rosaceae species will help us further elucidate the phylogenetic relationships and evolutionary history of Rosaceae.

Conclusions
In this study, we successfully assembled a chromosome-level genome of P. uniflora using advanced PacBio HiFi sequencing and Hi-C technologies.The resultant P. uniflora assembly exhibits a commendable level of continuity and completeness, albeit with a notable presence of repetitive elements.Subsequent gene predictions conducted on this high-quality genome assembly yielded a substantial set of 49,261 protein-coding genes and 3080 non-coding RNAs (ncRNAs).Significantly, a substantial portion of these protein-coding genes underwent functional annotation, offering valuable insights for forthcoming functional genomic inquiries pertaining to P. uniflora.Furthermore, P. uniflora and P. salicina clustered together according to comparative genomic analysis, and the P. uniflora genome had recently undergone a WGD event.
The high-quality genome assembly and meticulous annotation of P. uniflora presented herein are poised to expedite comprehensive genome-wide investigations concerning the biosynthesis of medicinal constituents in this traditional Chinese medicinal plant.

Conclusions
In this study, we successfully assembled a chromosome-level genome of P. uniflora using advanced PacBio HiFi sequencing and Hi-C technologies.The resultant P. uniflora assembly exhibits a commendable level of continuity and completeness, albeit with a notable presence of repetitive elements.Subsequent gene predictions conducted on this high-quality genome assembly yielded a substantial set of 49,261 protein-coding genes and 3080 non-coding RNAs (ncRNAs).Significantly, a substantial portion of these proteincoding genes underwent functional annotation, offering valuable insights for forthcoming functional genomic inquiries pertaining to P. uniflora.Furthermore, P. uniflora and P. salicina clustered together according to comparative genomic analysis, and the P. uniflora genome had recently undergone a WGD event.
The high-quality genome assembly and meticulous annotation of P. uniflora presented herein are poised to expedite comprehensive genome-wide investigations concerning the biosynthesis of medicinal constituents in this traditional Chinese medicinal plant.Furthermore, it serves as a valuable addition to the ongoing comparative genomics endeavors within the Rosaceae family.

Significance
This study marks the inaugural genome assembly and annotation of the Prinsepia species, thereby endowing the scientific community with a valuable genomic reservoir for advancing research endeavors related to P. uniflora.Moreover, it serves as a valuable contribution to the broader field of comparative genomics within the Rosaceae family.

Figure 1 .
Figure 1.Characteristics of the P. uniflora genome assembly.(a) Heatmap displaying the Hi-C interactions among the 16 pseudochromosomes of the P. uniflora genome.(b) P. uniflora's genomic features are displayed in non-overlapping windows of 1 Mb, and the tracks show the following: (a) GC content; (b) repeat density; (c) TE density; (d) density of unclassified repeats; and (e) gene density.

Figure 1 .
Figure 1.Characteristics of the P. uniflora genome assembly.(a) Heatmap displaying the Hi-C interactions among the 16 pseudochromosomes of the P. uniflora genome.(b) P. uniflora's genomic features are displayed in non-overlapping windows of 1 Mb, and the tracks show the following: (a) GC content; (b) repeat density; (c) TE density; (d) density of unclassified repeats; and (e) gene density.

Figure 2 .
Figure 2. Phylogenetic and evolutionary analysis of the P. uniflora genome.(a) A species tree comprising eight other Rosaceae species and P. uniflora, based on 118 single-copy orthogroups.Divergence times with 95% confidence intervals are indicated in blue, while bootstrap values are indicated in purple.Red dots represent the fossil calibration points.Chromosome numbers are positioned to the right of species names.(b) Ks distributions for the paralogs and orthologs found in P. uniflora and Prunus salicina's entire genomes.

Figure 2 .
Figure 2. Phylogenetic and evolutionary analysis of the P. uniflora genome.(a) A species tree comprising eight other Rosaceae species and P. uniflora, based on 118 single-copy orthogroups.Divergence times with 95% confidence intervals are indicated in blue, while bootstrap values are indicated in purple.Red dots represent the fossil calibration points.Chromosome numbers are positioned to the right of species names.(b) Ks distributions for the paralogs and orthologs found in P. uniflora and Prunus salicina's entire genomes.

Table 1 .
Overall statistics regarding the annotation and assembly of the P. uniflora genome.

Table 1 .
Overall statistics regarding the annotation and assembly of the P. uniflora genome.