Genome Assembly and Microsatellite Marker Development Using Illumina and PacBio Sequencing in the Carex pumila (Cyperaceae) from Korea

This study is the first to report the characterization of Carex pumila genomic information. Assembly of the genome generated a draft of C. pumila based on PacBio Sequel II and Illumina paired-end sequencing, which was assembled from 2941 contigs with an estimated genome size of 0.346 Gb. The estimate of repeats in the genome was 31.0%, and heterozygosity ranged from 0.426 to 0.441%. The integrity evaluation of the assembly revealed 1481 complete benchmarked universal single-copy orthologs (BUSCO) (91.76%), indicating the high quality of the draft assembly. A total of 23,402 protein-coding genes were successfully predicted and annotated in the protein database. UpsetR plots showed that 7481 orthogroups were shared by all species. The phylogenetic tree showed that C. pumila is a close but distant relative of Ananas comosus. C. pumila had greater contraction (3154) than expansion (392). Among the extended gene families, aquaporins have been found to be enriched. Primers for microsatellite markers determined 30 polymorphic markers out of 100. The average number of alleles amplified by these 30 polymorphic markers was 4 to 12, with an average polymorphism information content (PIC) value of 0.660. In conclusion, our study provides a useful resource for comparative genomics, phylogeny, and future population studies of C. pumila.


Introduction
The Cyperaceae family is a cosmopolitan plant family with medicinal properties, with about 5000 species distributed worldwide [1,2].The Cyperaceae family consists of medicinal plants such as Cyperus rotundus, Cyperus papyrus, and Carex baccans.The plant has been reported over the past few decades to exhibit excellent chemical versatility and traditional medicinal and anticancer, anti-inflammatory, antibacterial, and anthelmintic properties [1][2][3][4].
C. pumila is widely distributed in Australia (e.g., Lord Howe Island), New Zealand, Chile, China, Japan, and Republic of Korea and inhabits salty coastal dunes [5].Salt plants synthesize various bioactive molecules in response to survival mechanisms to survive in extreme environments [6].In recent studies, C. pumila has been reported to have potential effects in inhibiting IgE-mediated allergic reactions, and invasion and metastasis of cancer cells [5,7].To use it as an effective medicinal plant, it is important to cultivate C. pumila into cultivars.For faster growth and more medicinal enrichment, cultivar development is required, and genomic research can help develop new cultivars [8].In addition, genomic information can be used to facilitate drug discovery in medicinal plants [9].So far, four genome assemblies of the Cyperaceae family have been reported: Kobresia littledalei, Carex parvula, Carex kokanica, and Kobresia myosuroides [10][11][12][13].Currently, genomic information is continuously increasing, but it has been less compared to the number of species of Cyperaceae.No research has been reported on the genome information of C. pumila.parvula, Carex kokanica, and Kobresia myosuroides [10][11][12][13].Currently, genomic information is continuously increasing, but it has been less compared to the number of species of Cyperaceae.No research has been reported on the genome information of C. pumila.
Next-generation sequencing technology (NGS) is less expensive and more accurate than existing technologies.However, in the case of short-read sequencing, accuracy is high, but it has a challenge resolving repetitive regions.As the long-read technology develops simultaneously with the development of current technology, the problem of the repeat area is being solved [14].By complementing the second generation short-read sequencing and third generation long-read sequencing technologies, precise genome information can be obtained with higher accuracy [15].
In this study, we generated a draft genome through sequencing and assembly of C. pumila.We annotated genes, predicted protein-coding genes, determined phylogenetic positions using orthogroups, and developed microsatellite markers.This study provides an important resource for future population genetic studies and molecular breeding using the developed markers.

Sample Preparation
Samples were collected from three wild populations located in the Republic of Korea during 2022 (Figures 1 and 2) (Goseoung (GS; 38°22′01.00″N, 128°30′32.2″E), Incheon (IC; 37°14′50.2″N, 126°31′27.7″E), and Chungyang (CY: 36°19′14.27″N, 126°56′16.16″E)).Leaf samples for NGS analysis were collected from one Goseong population.The collected samples were dried using silica gel and stored in a deep freezer (−80 °C).Eight individuals from three populations were sampled for microsatellite marker polymorphism analysis.Genomic DNA was extracted from the stored samples using the DNEasy Plant Mini Kit (QIAGEN, Hilden, Germany) according to the manufacturer's protocol.The quality of the extracted genomic DNA was checked using a 1% agarose gel and a Nanodrop spectrophotometer.Total RNA was extracted from leaf tissue using the RNeasy Plant Mini Kit (QIAGEN) according to the manufacturer's protocol (Macrogen, Inc., Seoul, Republic of Korea).

Genome Sequencing and De Novo Assembly
The whole genome was sequenced using a 150 bp paired-end library produced by Macrogen (Macrogen Inc., Seoul, Republic of Korea) and using the Illumina HiSeq 2500 (Illumina, San Diego, CA, USA) platform.In addition, to solve the serial repeat problem in plants, we used single-molecule real-time sequencing from Pacific BioSciences (PacBio) following a protocol from Macrogen (Macrogen Inc., Seoul, Republic of Korea).DNA sequencing libraries were prepared and contiguous long reads (CLRs) obtained using the PacBio RS II DNA sequencer platform.Initial assembly was run with default parameter values using HGAP ver.3.0 [26] and Wtdbg2 ver.2.3 [27].Long sequences were assembled by merging similar segments into vertices and connecting reads based on segment adjacency.The Illumina short-read run with default parameters using the Trimmomatic ver.0.33 [28].Clean reads were obtained after removing adapter sequences.To improve the accuracy of initial contig assembly, BWA-MEM [29] and Pilon version 1. 23 [30] were used with default parameters.The draft genomes generated in two assemblers for the integrity of the assembly with Embryophyta odb version 10 using Benchmarking Universal Single-Copy Orthologs ver.5.2.2 [31].Of the two genomes, we used the genome from HGAP and with the best results for BUSCO and N50 values.The assembled genome has been deposited in NCBI GenBank whole-genome shotgun database under accession number JARJHM000000000.
RNA Iso-seq libraries were prepared for genome annotation and sequenced using the PacBio sequel II platform according to the manufacturer's protocol (Macrogen Inc., Seoul, Republic of Korea).Iso-seq results yielded a total of 18,257,235 subreads (Table 1).

Genome Sequencing and De Novo Assembly
The whole genome was sequenced using a 150 bp paired-end library produced by Macrogen (Macrogen Inc., Seoul, Republic of Korea) and using the Illumina HiSeq 2500 (Illumina, San Diego, CA, USA) platform.In addition, to solve the serial repeat problem in plants, we used single-molecule real-time sequencing from Pacific BioSciences (PacBio) following a protocol from Macrogen (Macrogen Inc., Seoul, Republic of Korea).DNA sequencing libraries were prepared and contiguous long reads (CLRs) obtained using the PacBio RS II DNA sequencer platform.Initial assembly was run with default parameter values using HGAP ver.3.0 [26] and Wtdbg2 ver.2.3 [27].Long sequences were assembled by merging similar segments into vertices and connecting reads based on segment adjacency.The Illumina short-read run with default parameters using the Trimmomatic ver.0.33 [28].Clean reads were obtained after removing adapter sequences.To improve the accuracy of initial contig assembly, BWA-MEM [29] and Pilon version 1. 23 [30] were used with default parameters.The draft genomes generated in two assemblers for the integrity of the assembly with Embryophyta odb version 10 using Benchmarking Universal Single-Copy Orthologs ver.5.2.2 [31].Of the two genomes, we used the genome from HGAP and with the best results for BUSCO and N50 values.The assembled genome has been deposited in NCBI GenBank whole-genome shotgun database under accession number JARJHM000000000.

Phylogenetic Tree Reconstruction
A total of 25 protein sequences derived from genomes were compared, including C. pumila, to other plant species.Protein sequences were downloaded from NCBI and used for analysis (Table S2).Protein sequences were obtained from OrthoFinder ver.2.5.4 [36] to organize the groups of orthologous genes.A matrix of the number of genes in orthogroups were calculated for each plant species and used with UpSetR [37].After selecting the significant groups with p < 0.01 from the phylogenetic tree using CAFE5 ver.5.0.0 [38], orthogroup expansion and contraction were measured using CafePlotter (https://github.com/moshi4/CafePlotter,accessed on 1 January 2022), and genes were reconstructed.

Genome Assembly of C. pumila
For whole-genome sequencing, 870,146,384 reads were generated using Illumina shortread sequencing, and the total read length was 131,392,103,984 bp.After removing adapter sequences and low-quality reads using Trimmomatic [28], the total number of clean reads was 350,433,302, and the total length of reads was 52,800,516,376 bp (Table 1).The Q20 and Q30 quality were 99.47% and 97.17%, respectively (Table 1).PacBio sequencing identified 33,136,805 subreads, with a total length of 127,339,898,260 bp and an average subread length of 3842 bp (Table 1).Genomes assembled from Wtdbg2 were not suitable because they had low numbers of contigs in contrast to low N50 values (N50: 160,342).The genome assembly using HGAP yielded 2941 contigs with a total length of 346,579,715 bp (Table 2).The length of the contigs ranged from 1,780,279 to 1345 bp (average 117,844 bp).The N50 value was 351,713 bp, and the average GC content of the C. pumila genome was 33.4% (Table 2).C. parvula and C. kokanicas had GC contents of 35.41% and 34.68%, which compared similarly with C. pumila [12].The k-mer distribution (Table S1) is shown in Figure 3 as a histogram constructed using Jellyfish software (ver.2.3.0)[32].The estimated haploid genome size is approximately 0.338 Gb, and the actual assembled genome size is 0.347 Gb.Although the estimated genome size of C. pumila appears small (C-value: 0.30; 0.58 Gb), the average genome size is similar to other Carex species in the Plant DNA C-value database (average 0.86 Gb; https://cvalues.science.kew.org,accessed on 1 January 2022).Although it falls short of the actual genome size, additional honing of long-read and short-read sequencing could improve genome size.Nonetheless, the draft genome from this study will help improve breeding and disease resistance.Complete BUSCO analysis confirmed that HGAP (91.76%) had a more complete assembly genome than Wtdbg2 (86.12%).BUSCO (Embryophyta odb10, https://busco-archive.ezlab.org/frame_plants.html,accessed on 1 January 2022) analysis identified a total of 1614 BUSCO of HGAP (Table 3).Of these, 91.76% (1481) of the Embryophyta gene sequences were complete (Table 3), which comprised 1416 single-copy BUSCO (87.73%) and 65 duplicated-copy BUSCO (4.02%).The number of fragmented BUSCO was 31 (1.92%), and the number of missing BUSCO was 102 (6.32%).C. parvula and C. kokanica, which are related species of C. pumila (91.76%), showed high integrity compared to the genomes of related species at 85.9% and 88.2%, respectively, indicating a high quality of the genome [12].

Phylogenetic Inference Orthologous Groups of C. pumila
We identified 60,498 orthogroups matching 1,418,513 genes using Orthofinder [36], including C. pumila and 25 species (Table S2).UpsetR plots showed that 7481 orthogroups were shared by all species (Figure 7).UpsetR plots showed 431 orthogroups specifically identified in C. pumila, and 765 orthogroups were shared by 25 species except C. pumila.clic compound binding"; and biological process: "metabolic process" and "cellular process", which were found to be significantly enriched (Table S3).Among them, aquaporin was also found to be abundant.
Aquaporin maintains efficient water transport in roots in a salt-stressed environment, resulting in high salinity tolerance [42].It has been reported that the salt plant (Puccinellia nuttalliana) is also rich in aquaporin [42].In addition, unlike other freshwater plants, C. pumila is presumed to be rich in aquaporins because it lives in the saline environments of sand dunes.This suggests that C. pumila has an expansion of genes related to salt tolerance, such as aquaporin, compared to other plants.The phylogenetic tree showed that C. pumila was close to A. comosus but distantly related.This is because the two species differ at the family level.We analyzed the expansion and contraction of gene families based on the data of orthogroups generated by Orthofinder (Figure 8) and specifically based on the statistically significant values of orthologroups (p < 0.01).There were significant gene family expansions (2-4005) and contractions (5-4267) among the 25 plant genomes (p < 0.01).C. pumila showed more contraction (3154) than expansion (392).
Aquaporin maintains efficient water transport in roots in a salt-stressed environment, resulting in high salinity tolerance [42].It has been reported that the salt plant (Puccinellia nuttalliana) is also rich in aquaporin [42].In addition, unlike other freshwater plants, C. pumila is presumed to be rich in aquaporins because it lives in the saline environments of sand dunes.This suggests that C. pumila has an expansion of genes related to salt tolerance, such as aquaporin, compared to other plants.

Development of Novel Microsatellite Markers for Population Study
We screened microsatellite regions in the assembled genome of C. pumila.The genome contained 62,565 microsatellite regions (Table 5).Di-nucleotides were the most common repeat motif (23,876; 38.16%), followed by tri- (23,857; 38.13%), tetra-(7617; 12.17%), penta-(3774; 6.03%), and hexa-(3441; 5.50%) nucleotides (Figure S1).AT repeats motif (62.44%) were most frequently distributed in di-nucleotides.AT/TA repeat motifs (62.44%) were most frequently distributed in di-nucleotides.It has also been observed in Matthiola incana [43], rice [44], Fagopyrum tataricum [45], and other species.This may be the reason for the higher number of slips in shorter repetitions [46].In the tri-nucleotide, AAT/ATT repeat motifs accounted for 49.22%.Tri-nucleotides also showed a distribution similar to that of dinucleotides, which may differ depending on the parameter value of microsatellite screening.Abundant repeats of AT in monocotyledonous plants are reported to be uncommon [47].In C. pumila, the low GC contents are thought to be due to the low GC contents of the nucleic acids in the presence of the high frequency of A and T present in the genome.

Development of Novel Microsatellite Markers for Population Study
We screened microsatellite regions in the assembled genome of C. pumila.The genome contained 62,565 microsatellite regions (Table 5).Di-nucleotides were the most common repeat motif (23,876; 38.16%), followed by tri-(23,857; 38.13%), tetra-(7617; 12.17%), penta-(3774; 6.03%), and hexa-(3441; 5.50%) nucleotides (Figure S1).AT repeats motif (62.44%) were most frequently distributed in di-nucleotides.AT/TA repeat motifs (62.44%) were most frequently distributed in di-nucleotides.It has also been observed in Matthiola incana [43], rice [44], Fagopyrum tataricum [45], and other species.This may be the reason for the higher number of slips in shorter repetitions [46].In the tri-nucleotide, AAT/ATT repeat motifs accounted for 49.22%.Tri-nucleotides also showed a distribution similar to that of di-nucleotides, which may differ depending on the parameter value of microsatellite screening.Abundant repeats of AT in monocotyledonous plants are reported to be uncommon [47].In C. pumila, the low GC contents are thought to be due to the low GC contents of the nucleic acids in the presence of the high frequency of A and T present in the genome.Of the 62,565 identified microsatellites, 89.11% (55,696) were suitable for developing markers.We randomly selected 100 primer sets, of which only 50 (50.0%)primer sets were successfully amplified.Of these, 30 markers (60.0%) were polymorphic with a PIC value of 0.3 or higher.Finally, we selected 30 microsatellite markers amplifying various repeat motifs for the population study (Table 6).These 30 markers have been deposited to NCBI under the accession numbers OQ685862-OQ685891 (Table 6).

Figure 1 .
Figure 1.One photograph of specimens of C. pumila used in the study.

Figure 1 .
Figure 1.One photograph of specimens of C. pumila used in the study.

Figure 3 .
Figure 3. Estimation of genome size using k-mer peaks of C. pumila.

Figure 3 .
Figure 3. Estimation of genome size using k-mer peaks of C. pumila.

Figure 4 .
Figure 4. Gene ontology term annotation categories of level 3 for C. pumila: molecular functions, cellular components, and biological processes.

Figure 4 .
Figure 4. Gene ontology term annotation categories of level 3 for C. pumila: molecular functions, cellular components, and biological processes.

Figure 5 .
Figure 5. EggNOG functional classification information in C. pumila.Genes were assigned to 24 categories, excluding the "extracellular structure" category.The wave pattern in the diagram is omitted because the size of the numbers is too large.

Figure 6 .
Figure 6.Percentage of KEGG ontology (KO) terms annotated in the C. pumila genetic dataset.Genes annotated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database are grouped into major functional categories based on the annotated pathways.

Figure 5 .
Figure 5. EggNOG functional classification information in C. pumila.Genes were assigned to 24 categories, excluding the "extracellular structure" category.The wave pattern in the diagram is omitted because the size of the numbers is too large.

Genes 2023 , 15 Figure 5 .
Figure 5. EggNOG functional classification information in C. pumila.Genes were assigned to 24 categories, excluding the "extracellular structure" category.The wave pattern in the diagram is omitted because the size of the numbers is too large.

Figure 6 .
Figure 6.Percentage of KEGG ontology (KO) terms annotated in the C. pumila genetic dataset.Genes annotated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database are grouped into major functional categories based on the annotated pathways.

Figure 6 .
Figure 6.Percentage of KEGG ontology (KO) terms annotated in the C. pumila genetic dataset.Genes annotated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database are grouped into major functional categories based on the annotated pathways.

Figure 7 .
Figure 7. Orthogonal clusters of proteins from 25 reference genomes displayed as UpSet plots.Straight lines connected at each intersection point represent orthogonal groups that are shared.Blue bar: Orthogonal cluster for Carex pumila, red bar: Orthogonal cluster for all species.

Figure 7 .
Figure 7. Orthogonal clusters of proteins from 25 reference genomes displayed as UpSet plots.Straight lines connected at each intersection point represent orthogonal groups that are shared.Blue bar: Orthogonal cluster for Carex pumila, red bar: Orthogonal cluster for all species.

Figure 8 .
Figure 8.A phylogenetic tree of 25 species including Bromeliaceae, Brassicaceae, Poaceae, and Cyperaceae plants.Using CAFE5, gene family expansions (+) and contractions (−) were calculated at each ancestral node and divergence and species.

Table 1 .
Summary information of raw data for long-and short-read sequencing in C. pumila.

Table 1 .
Summary information of raw data for long-and short-read sequencing in C. pumila.

Table 2 .
Summary information of genome assembly of C. pumila.

Table 3 .
Benchmarking universal single-copy orthologs analysis results of the two assembly methods.

Table 3 .
Benchmarking universal single-copy orthologs analysis results of the two assembly methods.

Table 4 .
Summary of protein gene predictions and annotations for C. pumila.

Table 4 .
Summary of protein gene predictions and annotations for C. pumila.

Table 5 .
Microsatellite information screened from the C. pumila genome sequence.

Table 5 .
Microsatellite information screened from the C. pumila genome sequence.

Table 6 .
Characteristic and diversity information for the 30 microsatellite loci of C. pumila.

Table 6 .
Cont. : number of samples, N A : number of alleles, H O : observed, H E : expected heterozygosity, PIC: polymorphism information content. N