De Novo Genome Assembly of the Whitespot Parrotfish (Scarus forsteni): A Valuable Scaridae Genomic Resource

Scarus forsteni, a whitespot parrotfish from the Scaridae family, is a herbivorous fish inhabiting coral reef ecosystems. The deterioration of coral reefs has highly affected the habitats of the parrotfish. The decline in genetic diversity of parrotfish emphasizes the critical importance of conserving their genetic variability to ensure the resilience and sustainability of marine ecosystems for future generations. In this study, a genome of S. forsteni was assembled de novo through using Illumina and Nanopore sequencing. The 1.71-Gb genome of S. forsteni, was assembled into 544 contigs (assembly level: contig). It exhibited an N50 length of 17.97 Mb and a GC content percentage of 39.32%. Our BUSCO analysis revealed that the complete protein of the S. forsteni genome had 98.10% integrity. Combined with structure annotation data, 34,140 (74.81%) genes were functionally annotated out of 45,638 predicted protein-coding genes. Upon comparing the genome size and TE content of teleost fishes, a roughly linear relationship was observed between these two parameters. However, TE content is not a decisive factor in determining the genome size of S. forsteni. Population history analysis results indicate that S. forsteni experienced two major population expansions, both of which occurred before the last interglacial period. In addition, through a comparative genomic analysis of the evolutionary relationship of other species, it was found that S. forsteni had the closest relationship with Cheilinus undulatus, another member of the Labridae family. Our expansion and contraction analysis of the gene family showed that the expansion genes were mainly associated with immune diseases, organismal systems, and cellular processes. At the same time, cell transcription and translation, sex hormone regulation, and other related pathways were also more prominent in the positive selection genes. The genomic sequence of S. forsteni offers valuable resources for future investigations on the conservation, evolution, and behavior of fish species.


Introduction
Scarus forsteni, commonly known as the whitespot parrotfish, is a herbivorous species in the family Scaridae, native to tropical areas in the western Pacific Ocean [1].The high species richness of the Scaridae family, emblematic of reef herbivores, is likely driven by the diversifying pressures of the complex reef ecosystem [2].The Scaridae family is classified into 10 genera, comprising a total of 90 identified species [3].Within this family, the genus Scarus is the largest one, encompassing about 50 species [4].Scaridae originated in the late Miocene, around 10 million years ago (Mya), and subsequently underwent a rapid evolutionary split into three major clades [5][6][7].This diversification continued at a swift pace during the Pleistocene [8,9].This species diversity, resulting from rapid differentiation, has made parrotfish famous for their bright colors, unique oral morphology, and critical role in coral reef ecosystems.The moniker "parrotfish" is derived from their unique beaklike structure, reminiscent of a parrot's beak, which is robust and efficacious enough to grind hard coral and rock in their quest for sustenance [10].Research has indicated that parrotfishes play a role in excavating or scraping algae from coral substrates.Corals and algae share similar ecological niches, and the excavation of algae can increase corals' light exposure [11].The ingested material is processed in the pharynx, where three dentate pharyngeal bones, located at the back of the throat, work in conjunction with a special type of monophyodont dentine unique to parrotfish (characterized by pulp filled with alveolar bone), effectively grinding the food into a fine paste [12,13].This mechanical breakdown promotes the digestion of algal nutrients.It is worth noting that parrotfishes lack a stomach, necessitating the thorough mastication of all ingested substances.After this, fine particles of food are more efficiently decomposed by digestive enzymes.This results in more nutrients being absorbed through the intestinal mucosa, while unabsorbed coral fragments are expelled in the form of white calcium particles [12,14].
Parrotfish, similar to most tropical reef fishes, exhibit hermaphroditism.Typically, hermaphroditic sexual transitions in fish involve substantial and often irreversible alterations in gonadal anatomy and function, accompanied by changes in coloration and behavior.In the case of parrotfish, there exist two or more male morphotypes exhibiting distinct reproductive behaviors, often correlated with variations in gonadal structures and external body morphology [15,16].Research suggests that this unique reproductive trait may be linked to the coral reefs in which parrotfishes live.They often form small groups in coral reef areas, and some species exhibit territorial behaviors [17,18].An expanding corpus of research indicates that habitats exert a substantial influence on speciation patterns via sexual selection.In this context, coral reef environments are likely to offer enhanced opportunities for territorial and mating behaviors, thereby intensifying the process of sexual selection [2].As we all know, parrotfish play a critical role in the health of coral reef ecosystems, especially in the context of current ocean acidification [19].By grazing on algae that vie for space with corals and by clearing away the overgrowth, they create the open areas essential for coral growth.Additionally, their waste, rich in nutrients, helps sustain the healthy development of corals [11].The well-being of live corals, serving as a haven for marine fish, is crucial for preserving biodiversity and ensuring ecosystem stability [18,20].
Parrotfish face a significant challenge: they cannot be bred through aquaculture, making their population in the wild particularly precious.Unfortunately, due to their vibrant colors and unique taste, parrotfish have become a high-value target for commercial fisheries.Overfishing, especially for consumption and ornamental purposes, poses a severe threat to these valuable fish species.Parrotfish populations are currently facing significant threats due to overfishing and coral reef degradation, which could also lead to a decline in the genetic diversity of some parrotfish [21,22].The conservation of these fish and their habitats is imperative for the preservation of the equilibrium and biodiversity within coral reef ecosystems [18,23].Owing to their distinct biological characteristics and vital roles in ecological processes, parrotfish have emerged as critical subjects in marine biology and environmental protection research [24].The comprehensive data provided by whole genome sequencing significantly enhances genetic improvement programs across various aquaculture species.Assembling a de novo whole genome is particularly effective in addressing genetic and evolutionary knowledge gaps in non-model organisms [25,26].Particularly for the genus Scarus, this study presents the first key reference genome within the Scaridae family.This reference genome will serve as an invaluable tool for the discovery and genotyping of Single-Nucleotide Polymorphisms (SNPs) from RAD-seq data in closely related species [27,28] in population genomics approaches.In this context, ge-Genes 2024, 15, 249 3 of 21 nomic research on parrotfish enhances the conservation of genetic diversity, facilitates the identification of functional genes, and provides insights into evolutionary trends [9,24].Additionally, previous studies on parrotfish have predominantly focused on mitochondrial genome analysis, population distribution, regional hybridization, and otolith characteristics, with a notable absence of comprehensive genome-wide analysis data [5,7,18,29,30].The development of new genomic resources is crucial for addressing the current limitations in genetic research on the S. forsteni species.High-throughput whole genome sequencing allows us to accurately map the basic genetic blueprint of the species, unlocking the secrets held in its DNA.
In this study, our primary objective was to assemble the genome of the whitespot parrotfish, S. forsteni, de novo using both Illumina and Nanopore technologies (i.e., shortand long-read sequencing).Our aim was to generate good-quality data for assembly and annotation to enhance the understanding of the genetic, ecological, evolutionary, and developmental aspects of marine fish, particularly parrotfish.We sought to investigate the relationship between transposable element (TE) content and genome size in parrotfish and its evolutionary significance.Additionally, we aimed to utilize a comparative genomic approach to identify conserved homologous genes in parrotfish and conduct enrichment analysis to gain insights into genes and regulatory elements potentially associated with the species' dietary traits and digestive system.Overall, our research was designed to augment the pool of good-quality marine fish genomic resources, thereby providing invaluable assets for functional genomic studies and conservation efforts related to parrotfish.

Sample Collection and DNA Sequencing
A single adult female Scarus forsteni sample, obtained from Xincun Port in Hainan Province, China [31], was selected for genome sequencing and assembly.Muscle tissue was used for genomic DNA (gDNA) extraction, sequencing, and library construction.Highquality DNA was extracted from fresh muscle tissues using the DNeasy Blood & Tissue Kits (Qiagen, Halden, Germany).DNA quality and quantity were evaluated using electrophoresis on a 1% agarose gel and using Quant-iT™ PicoGreen ® dsDNA Reagent and Kits (Thermo Fisher Scientific, Waltham, MA, USA).Then, the DNA was sequenced using two libraries with short and long reads.The library of short sequences was constructed by fragmenting DNA into 300-500 bp fragments using the Covaris 2000 ultrasonicator, followed by purification, end repair, and PCR amplification.This process led to the construction of a DNA library, which was then sequenced on the HiSeq 2500 system with 250 bp PE mode or 100 bp PE mode (Illumina Inc., San Diego, CA, USA).Long-read sequencing involved processing using the Blue Pippin System from Sage Science, USA, to select long fragments targeting a size of approximately 20 kb in preparation for Nanopore sequencing.These fragments were prepared using the 1D Ligation Sequencing Kit (SQK-LSK109) protocol developed by Nanopore, Oxford, UK, with sequencing adapters attached.The genomic DNA libraries, about 20 kb in size, were quantified with Invitrogen's Qubit 3.0 Fluorometer, Camarillo, USA.The libraries were sequenced on a single flow cell of the PromethION DNA sequencer (Oxford Nanopore, UK) following the manufacturer's guidelines.

Sequence Data Processing and Genome Survey
The quality of the raw genomic library data from Illumina was assessed using FastQC v.0.11.9 [32,33].Then, SOAPnuke v2.X [34] was used to exclude low-quality reads after trimming, with low-quality reads being defined by the default parameter (parameters: -l 5 -q 0.3 -n 0.05 -E 50).The trimmed reads processed by SOAPnuke v2.X were utilized for genome size estimation using GCE v1.0.2 software [35].The frequencies of 17 k-mers were counted using GCE v1.0.2 [35] (https://github.com/fanagislab/GCE/tree/master/gce-alternative, accessed on 20 February 2023.), and the heterozygosity and repetitiveness rates of the S. forsteni genome were also predicted.The genome size was calculated using the following: size = k-mer number/peak depth.Heterozygosity and repetitiveness were calculated using the built-in scripts of the software program.

De Novo Genome Assembly and Quality Assessment
Long-read sequencing was carried out for S. forsteni genome assembly, and errors in Nanopore clean reads were corrected using NextDenovo (https://github.com/Nextomics/NextDenovo, accessed on 15 March 2023).The seed threshold was set to 30 k, and the assembly was carried out using the default parameters of the software.Thereafter, the genome assembled via Nanopore technology was refined through two stages of correction using NextPolish [36], which involved the application of both error-corrected long reads and Illumina DNA short reads.To assess the percentage of the original reads represented in the genome assembly, the quality of the S. forsteni assembly was evaluated by mapping Illumina reads back to the assembly using BWA-MEM v2 [37].Finally, the integrity of the genome was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.4 [38], evaluating 3640 core vertebrate genes with the "actinopterygii_odb10" dataset for 26 fish species (options -m genome).

Gene Prediction and Annotation
Prior to genome annotation, a repeat sequence library was constructed using Repeat-Modeler (v 2.0.3)[39].This library was then searched with RepeatMasker (v 4.1.2)[40] to predict repeat sequences in the S. forsteni genome.Following this, any overlapping or redundant elements in the predictions from various methods were eliminated.This streamlined process was then used to pinpoint and mask the repetitive sequences in the genome.The repeat-masked genome sequence was then used for subsequent gene annotation.In this study, three strategies were employed to predict the protein-coding genes in the assembled genome, with each chosen strategy being designed to complement the others and enhance the overall accuracy and comprehensiveness of our genomic annotation.Initially, the genome was annotated de novo using the self-training mode of the Augustus software (v 3.4.0)[41].The annotations were then processed using the SNAP_to_GFF3.pland augustus_GTF_to_EVM_GFF3.pl scripts from the EvidenceModeler (EVM, v 1.1.1)[42] package.Second, protein-coding genes were predicted based on transcriptome data using the TransDecoder-v5.7.0 software (https://github.com/TransDecoder/TransDecoder,accessed on 25 April 2023), and RNA sequencing (RNA-seq) data were obtained from the unpublished data of our research group.Third, protein sequences of Labrus bergylta, Cheilinus undulatus, Notolabrus celidotus, Sparus aurata, and Acanthopagrus latus were downloaded from the NCBI database (https://www.ncbi.nlm.nih.gov,accessed on 25 April 2023) for homology prediction with the parrotfish genome.Genewise [43] was then utilized to analyze these alignments, aiming to accurately determine spliced alignments.Finally, EVM was used to integrate the protein-coding gene predictions to form a consensus gene set.
The functional annotation of predicted protein-coding genes was performed using BLAST (blastp) with an e-value of 1 × 10 −5 against several databases [44]: the SWISS-PROT database [45], the NCBI non-redundant protein (NR) database, the Translated EMBL (TrEMBL) database, the Clusters of Orthologous Genes (COG) database, and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [46].Additionally, protein motifs and domains were annotated by searching the InterPro and Gene Ontology (GO) databases using InterProScan (v.4.8) [47].Infernal v1.1.2software was used to predict partial noncoding proteins [48].

Population History Analysis
To gain a deeper understanding of the population dynamics and history of S. forsteni, we employed the Pairwise Sequentially Markovian Coalescent (PSMC v0.6.5)model to infer its population size history from a diploid sequence [56].This analysis helped reveal the species' adaptations and evolutionary processes in response to past environmental changes and ecological pressures, providing valuable insights into its population history.To generate the whole genome diploid consensus sequence, clean data were mapped to the S. forsteni assembled genome using the BWA-MEM v2 [37].SAMtools v1.13 [57] was then employed to produce the diploid consensus under default parameters (https://github.com/lh3/psmc,accessed on 25 October 2023), with the parameters set to "-d 10 -D 100".PSMC model estimates were run with the options -N25 -t15 -r5 -p "4 + 25 × 2 + 4 + 6", assuming a generation time of 2 years for S. forsteni and a substitution rate of 2.0 × 10 −9 per site per year [56,58].

Expansion and Contraction of Gene Families
Gene family expansions, involving increased gene copies, and contractions, with reduced gene copies, were central to our study.We conducted an expansion and contraction analysis of single-copy homologous gene families across 15 species to understand their evolutionary dynamics and functional implications.Our main goal was to explore how gene families evolved within these species and uncover potential adaptations or functional changes in an evolutionary context.Gene family expansions and contractions were examined using CAFÉ v5.0 [63] based on orthologs identified from gene family clustering and a phylogenetic analysis of single-copy orthologous genes.Variations in gene families along each lineage of the phylogenetic tree were predicted using a random birth and death model.The significance of gene family changes was assessed by comparing conditional likelihoods from a probabilistic graphical model using p values.GO term and KEGG pathway enrichment analyses were conducted based on gene families specifically expanded (p < 0.05) in S. forsteni using clusterProfiler [64].

Screening of Positive Selection Genes
To elucidate the mechanisms underlying the adaptive evolution of S. forsteni and to identify specific genes that may have played a crucial role in its unique adaptations, we conducted a detailed analysis of selective pressures using single-copy orthologous genes from 15 species.The ratio of non-synonymous substitution (Ka) to synonymous substitution (Ks), namely the ω value (Ka/Ks), is commonly employed to represent the selection pressure in sequence analysis.In this research, PAML v.4.9 software [62] and its CODEML tool were used to estimate the Ka/Ks (omega) values.The branch-site model was chosen due to its better alignment with the actual scenarios of intricate species divergence processes.For each gene, CODEML was used to calculate the likelihood value [65], and this value was further utilized to compute the likelihood ratio statistic.Subsequently, a chisquare test was applied to obtain the corresponding p-value.To control for false discovery rate (FDR) across the entire genome, all p-values were adjusted accordingly.Genes with FDR values less than 0.05 were selected as the final-candidate positively selected genes [66].Next, eggnog-mapper v2. 1 [67] was used to further annotate the resulting positive selection genes.Finally, using clusterProfiler [64], GO term and KEGG pathway enrichment analyses were performed on gene families that showed positive selection (p < 0.05) in S. forsteni.

Genome Assembly and Quality Assessment
Our sequencing project generated approximately 123.04 Gb of raw Illumina pairedend reads.These were processed using SOAPnuke v2.X software to remove low-quality reads and adapter sequences, resulting in ~109 Gb of clean data.This high-quality dataset was then utilized for genome size estimation.The GCE v1.0.2 was utilized to generate a histogram for sequencing depth distribution (k = 17) (Figure 1).Upon the visualization of our results, a single k-mer coverage peak at approximately 65× coverage depth, indicative of a homozygous peak, was revealed.Additionally, a heterozygous peak at around 30× coverage depth and a peak corresponding to repetitive sequences between 100× and 150× coverage depth were also observed.As estimated by the GCE v1.0.2 software, the genome size was approximately 1.61 Gb.A script-based calculation revealed a heterozygosity rate of 0.78%, classifying it as a genome with low heterozygosity.Utilizing Oxford Nanopore sequencing technology, we obtained 169 GB of raw data.These raw data were then subjected to genome assembly using NextDenovo software V2.1-beta.0 with the default parameters set for raw data input.Subsequently, adapter sequences were removed and errors corrected using NextPolish, resulting in an optimized draft genome of the parrotfish (assembly level: contig) with a total size of 1.71 Gb.The assembly comprised 544 contigs, with the largest contig being 58.57Mb.The N50 length was 17.97 Mb, and the GC content was 39.32% (Table 1).This species' genome is larger than the known genomes of other marine fishes.Searches for BUSCO v5.4.4 using the actinopterygii_odb10 core gene sets showed that the assembly genome contained 98.10% of the complete sequences and 0.58% of the fragmented sequences of genes, with 1.32% of the genes being missing (Table 2).Moreover, 1.57% of the genes were identified as duplicates.Upon comparison with Illumina DNA short reads, facilitated by using BWA-MEM v2 software, it was found that 95.87% of the reads were correctly matched.These results indicated that our genome had a high assembly efficiency and integrity.
95.87% of the reads were correctly matched.These results indicated that our genome had a high assembly efficiency and integrity.

Genome Annotation
De novo and homologous annotations were used to predict gene structure, together with transcriptome data, resulting in 45,638 predicted genes for gene annotation.Using BLAST (blastp) for functional annotation against public databases such as COG, KEGG, NR, SWISS-PROT, and TREMBL, 34,140 of the predicted protein-coding genes were functional, accounting for 74.81% of the predicted genes.Among them, 54.61% of the genes were compared with SWISS-PROT, and more than 70% of the genes were annotated in NR and TREMBL.And a total of 8763 genes could be annotated in all databases (Figure 2,

Genome Annotation
De novo and homologous annotations were used to predict gene structure, together with transcriptome data, resulting in 45,638 predicted genes for gene annotation.Using BLAST (blastp) for functional annotation against public databases such as COG, KEGG, NR, SWISS-PROT, and TREMBL, 34,140 of the predicted protein-coding genes were functional, accounting for 74.81% of the predicted genes.Among them, 54.61% of the genes were compared with SWISS-PROT, and more than 70% of the genes were annotated in NR and TREMBL.And a total of 8763 genes could be annotated in all databases (Figure 2, Table S1).In addition, our genomic analysis predicted the presence of non-protein-coding genes, namely 251 rRNA genes, 2028 tRNA genes, 473 small-nuclear RNA (snRNA) genes, and 350 microRNA (miRNA) genes (Table 3).

Repetitive Element Annotation
Repeat sequence elements of the S. forsteni genome were annotated by using Repeat-Masker (v 4.1.2) and RepeatModeller2.The sequence of repetitive elements contributed 880.8 Mb of the assembly and accounted for ~51.4% of the genome length (Table 4).In the S. forsteni genome assembly, it was observed that class I TEs (RNA transposons or retrotransposons) comprised approximately 9.74% of the genome.The predominant retrotransposons identified were Long Interspersed Nuclear Elements (LINEs), accounting for 70.63% of all class I transposons detected.Additionally, the genome of S. forsteni was found to be abundant in class II TEs (DNA transposons), constituting nearly 23.04% of the total genome.In a comparative analysis of TE content across S. forsteni and seven other species, a positive correlation regarding the relationship between TE content and genome size was observed, and this correlation is approximately linear (p value: 0.0825) (Figure 3A, Table S2).For other fish species that evolved later, such as S. forsteni, C. undulatus, and A. ocellaris, the content of Transposable Elements (TEs) was found to be similar (Figure 3B).

Repetitive Element Annotation
Repeat sequence elements of the S. forsteni genome were annotated by using Repeat-Masker (v 4.1.2) and RepeatModeller2.The sequence of repetitive elements contributed 880.8 Mb of the assembly and accounted for ~51.4% of the genome length (Table 4).In the S. forsteni genome assembly, it was observed that class I TEs (RNA transposons or retrotransposons) comprised approximately 9.74% of the genome.The predominant retrotransposons identified were Long Interspersed Nuclear Elements (LINEs), accounting for 70.63% of all class I transposons detected.Additionally, the genome of S. forsteni was found to be abundant in class II TEs (DNA transposons), constituting nearly 23.04% of the total genome.In a comparative analysis of TE content across S. forsteni and seven other species, a positive correlation regarding the relationship between TE content and genome size was observed, and this correlation is approximately linear (p value: 0.0825) (Figure 3A, Table S2).For other fish species that evolved later, such as S. forsteni, C. undulatus, and A. ocellaris, the content of Transposable Elements (TEs) was found to be similar (Figure 3B).

Population History of S. forsteni
In order to gain insights into the historical population dynamics of S. forsteni, this study employed a Pairwise Sequentially Markovian Coalescent (PSMC) analysis.This method infers past effective population sizes over different time points by assessing the distribution of coalescent times across various segments of a single individual's genome sequence.PSMC analysis is widely regarded as an effective tool for studying population history and dynamics, being particularly suitable for species where extensive historical samples are not readily available.The effective population size of S. forsteni between 10 Ka and 100 Ma is shown in Figure 4.The visualizations show that between 10 Ma and 4

Population History of S. forsteni
In order to gain insights into the historical population dynamics of S. forsteni, this study employed a Pairwise Sequentially Markovian Coalescent (PSMC) analysis.This method infers past effective population sizes over different time points by assessing the distribution of coalescent times across various segments of a single individual's genome sequence.PSMC analysis is widely regarded as an effective tool for studying population history and dynamics, being particularly suitable for species where extensive historical samples are not readily available.The effective population size of S. forsteni between 10 Ka and 100 Ma is shown in Figure 4.The visualizations show that between 10 Ma and 4 Ma, there was a marked expansion in population size, reaching a peak at 4 Ma.This was followed by a slight decline in population size, which then began to expand again around 1000 ka, reaching a second peak, and became the highest at 700 Ka.Subsequently, the population size showed a stepwise decline until 100 Ka.
Ma, there was a marked expansion in population size, reaching a peak at 4 Ma.This was followed by a slight decline in population size, which then began to expand again around 1000 ka, reaching a second peak, and became the highest at 700 Ka.Subsequently, the population size showed a stepwise decline until 100 Ka.

Genome Evolution Analysis
To investigate the phylogenetic relationship of the S. forsteni with 14 species-O.niloticus, O. latipes, L. crocea, C. rostratus, P. leopardus, A. centrarchus, L. bergylta, C. undulatus, N. celidotus, D. rerio, A. polyacanthus, A. ocellaris, S. partitus, and H. sapiens-we conducted a comparative genomics analysis.The results show that the protein-coding genes of the total 15 species were clustered into 3234 gene families.Among these gene families, 1196 were single-copy gene families (Figure 5), and 42,555 genes from 30,862 orthogroups were identified in S. forsteni.Upon analyzing the data for single-copy orthologs, multi-copy orthologs, unique paralogs, other orthologs, and unclustered genes across the 15 species, it was discovered that S. forsteni, similar to both zebrafish and humans, possesses a significantly high number of genes.Next, we utilized the coding sequences of 1196 singlecopy homologous genes to construct a phylogenetic tree and determine divergence times (Figure 6).According to our phylogenetic analysis, S. forsteni and C. undulatus, as well as N. celidotus and L. bergylta, were members of the Labridae family.It was determined that S. forsteni and C. undulatus diverged from a common ancestor approximately 45.2 Mya.

Genome Evolution Analysis
To investigate the phylogenetic relationship of the S. forsteni with 14 species-O.niloticus, O. latipes, L. crocea, C. rostratus, P. leopardus, A. centrarchus, L. bergylta, C. undulatus, N. celidotus, D. rerio, A. polyacanthus, A. ocellaris, S. partitus, and H. sapiens-we conducted a comparative genomics analysis.The results show that the protein-coding genes of the total 15 species were clustered into 3234 gene families.Among these gene families, 1196 were single-copy gene families (Figure 5), and 42,555 genes from 30,862 orthogroups were identified in S. forsteni.Upon analyzing the data for single-copy orthologs, multi-copy orthologs, unique paralogs, other orthologs, and unclustered genes across the 15 species, it was discovered that S. forsteni, similar to both zebrafish and humans, possesses a significantly high number of genes.Next, we utilized the coding sequences of 1196 single-copy homologous genes to construct a phylogenetic tree and determine divergence times (Figure 6).According to our phylogenetic analysis, S. forsteni and C. undulatus, as well as N. celidotus and L. bergylta, were members of the Labridae family.It was determined that S. forsteni and C. undulatus diverged from a common ancestor approximately 45.2 Mya.

Expansion and Contraction of Gene Families
To investigate the adaptive evolution of S. forsteni, we estimated the expansion and contraction of the gene families.A total of 391 genes were identified as expanded in the S. forsteni genome, and 294 were identified as contracted in the S. forsteni genome.GO term and KEGG pathway enrichment analyses were performed for 197 significantly expanded genes (p < 0.05) (Tables S3 and S4) and 19 significantly contracted genes (p < 0.05) (Tables S5 and S6).The top 20 pathways from the KEGG enrichment analysis revealed that the expanded genes were primarily enriched in categories such as human diseases, organismal systems, and cellular processes.Specifically, this was manifested in pathways associated with prion disease, Parkinson's disease, lipid and atherosclerosis, immune system, gap junction, mineral absorption, and phagosomes.Our GO term analysis of the expanded genes showed a higher number of genes in pathways related to processes such as the negative regulation of supramolecular fiber organization, cerebellum development, and metencephalon development (Figure 7).In addition, the enrichment of the contracted gene families was mainly demonstrated in olfactory transduction, the neurotrophin signaling pathway, and the regulation of apoptotic cell clearance (Figure 8).gap junction, mineral absorption, and phagosomes.Our GO term analysis of the expanded genes showed a higher number of genes in pathways related to processes such as the negative regulation of supramolecular fiber organization, cerebellum development, and metencephalon development (Figure 7).In addition, the enrichment of the contracted gene families was mainly demonstrated in olfactory transduction, the neurotrophin signaling pathway, and the regulation of apoptotic cell clearance (Figure 8).

Analysis of Positive Selection Genes
According to the branch-site of S. forsteni in the phylogenetic tree, we identified that 206 genes were subjected to significantly positive selection (p < 0.05).Our KEGG enrichment analysis indicated significant involvement in Brite Hierarchies such as transcription panded genes showed a higher number of genes in pathways related to processes such as the negative regulation of supramolecular fiber organization, cerebellum development, and metencephalon development (Figure 7).In addition, the enrichment of the contracted gene families was mainly demonstrated in olfactory transduction, the neurotrophin signaling pathway, and the regulation of apoptotic cell clearance (Figure 8).

Analysis of Positive Selection Genes
According to the branch-site of S. forsteni in the phylogenetic tree, we identified that 206 genes were subjected to significantly positive selection (p < 0.05).Our KEGG enrichment analysis indicated significant involvement in Brite Hierarchies such as transcription

Analysis of Positive Selection Genes
According to the branch-site of S. forsteni in the phylogenetic tree, we identified that 206 genes were subjected to significantly positive selection (p < 0.05).Our KEGG enrichment analysis indicated significant involvement in Brite Hierarchies such as transcription machinery, translation factors, and the spliceosome.Additionally, there was notable representation in the signaling pathways regulating the pluripotency of stem cells, tight junctions, and the estrogen signaling pathway.The regulation of post-embryonic development, oocyte development, polytene chromosome development, and various complexes were predominantly featured in the GO term analysis (Figure 9; Tables S7 and S8).machinery, translation factors, and the spliceosome.Additionally, there was notable representation in the signaling pathways regulating the pluripotency of stem cells, tight junctions, and the estrogen signaling pathway.The regulation of post-embryonic development, oocyte development, polytene chromosome development, and various complexes were predominantly featured in the GO term analysis (Figure 9; Tables S7 and S8).

Genome Assembly and Annotation
In this study, the size of the genome of whitespot parrotfish was estimated to be about 1.61 Gb, according to our k-mer analysis.This analysis not only provided an estimate of

Genome Assembly and Annotation
In this study, the size of the genome of whitespot parrotfish was estimated to be about 1.61 Gb, according to our k-mer analysis.This analysis not only provided an estimate of genome size but also revealed insights into the genome's complexity, repetitive elements, error rates, polymorphism levels, and correlation with GC content [35].However, based on de novo assembly, we obtained a genome size of 1.71 Gb for the whitespot parrotfish.This variation in genome size may be due to changes in the quality and depth of sequencing data or assembly strategies [68,69].Among them, we found that the whitespot parrotfish exhibited higher heterozygosity compared to C. undulatus [70].Our k-mer analysis revealed the presence of highly repetitive and heterozygous regions in the parrotfish genome.This may explain why our genome size result obtained from de novo assembly (1.71 Gb) was larger than that obtained from the k-mer analysis.These features may have implications for the actual genome assembly process [71].The GC content is an important sequence index affecting the randomness of genome sequencing [72].It was revealed that GC content of whitespot parrotfish is 39.32%, which is similar to that of species with relatively smaller genome sizes, such as Symphodus melops (41.8%),Semicossyphus pulcher (41.65%),Labroides dimidiatus (40.8%), and Labrus mixtus (41%, https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_963584025.1/,accessed on 25 October 2023) [73][74][75].In certain fish species, a negative correlation has been observed between GC content and genome size, indicating that species with larger genomes tend to have lower GC contents [70,76].However, this trend was not universally observed across all species.Moreover, N50 mainly measures the average length of the assembly sequence, and a higher N50 value usually means that the assembly contains a longer sequence, which can be seen as a sign of good assembly quality [77].The whitespot parrotfish genome had an N50 length of 17.97 Mb, indicating a good-quality assembly.BUSCO estimated the completeness and redundancy of processed genomic data based on universal single-copy orthologs [38].In the BUSCO analysis, a comparison against the Actinopterygii_odb10 core gene set revealed that 3571 out of 3640 genes were present in the S. forsteni genome, achieving a completeness of 98.10%.This was a relatively good result for a contig-level genome.This was slightly higher than the 96.36% completeness observed in C. undulatus [70].Additionally, comparing annotation statistics from genomes closely related to our subject serves as an indirect indicator of the assembly's quality [26].Of the predicted 45,638 protein-coding genes, 34,140 coding sequences showed homology to each database protein in functional annotation and were successfully annotated.That is more than the number of proteins encoded by most marine fish [25,70,78].Overall, the assembly of the S. forsteni genome was of good quality, rendering it suitable for subsequent research and in-depth analysis.

Repetitive Element Analysis
Among repetitive elements, transposable elements (TEs) received the most attention.TEs are DNA sequences capable of changing their positions within the genome, sometimes causing or reversing mutations and altering the genetic characteristics and genome size of a species [76,79].Research has found that TE content varies, ranging from 5% in pufferfish to 56% in zebrafish, and it has been positively correlated with the genome size of fish [55].Our results also exhibited a similar trend regarding this correlation.The content of repetitive sequences in the genome of S. forsteni was excessively high, which may be closely related to its high content of TEs.Class I TEs and Class II TEs account for 9.74% and 23.04% of the genome, respectively.The predominance of Class II TEs in teleost genomes suggests a significant composition of DNA transposons, coupled with a relatively sparse presence of ancient TE copies.This pattern indicates that DNA transposons play a dominant role in shaping the genomic architecture of these species [53].Some studies have shown that the content of transposons in different species is closely related to their historical evolution and genome size [51,53,80].Throughout evolutionary history, the size of genomes and the occurrence of transposable elements has affected the structural and functional aspects of both genomes and cells.Variations in these aspects have impacted the morphological and functional features of organisms upon which natural selection exerts its influence directly [81].In a comparative analysis of TE content across S. forsteni and seven other species, a positive correlation was observed, indicating that the relationship between TE content and genome size is approximately linear.As an earlier evolved species, zebrafish, with its larger genome size, encompassed a higher proportion of TEs.In comparison with other later-evolved fish species, the TE content in S. forsteni, C. undulatus, and A. ocellaris was found to be similarly high, exceeding 30% in each of these species (Figure 3, Table S2).Additionally, S. forsteni and C. undulatus, which are phylogenetically sister taxa, also exhibited relatively larger genome sizes compared to other parrotfish species.In summary, the high content of repetitive sequences, especially TEs, in the S. forsteni genome suggests their potential influence on genome size, but this relationship is not conclusive, and additional research is required.

Population History of S. forsteni
Past population sizes can be inferred based on the representative genome sequence of a species.This approach enables the exploration of biological questions such as the impacts of climatic events on population size and structure, the effects of human activities on wildlife populations, and the consequences of domestication [82][83][84][85][86].The Pairwise Sequentially Markovian Coalescent (PSMC) model is a mature analytical method for inferring population dynamics [56].Our PSMC analysis of S. forsteni showed that the population increased to 4 Ma and then began to decline.This may be because, 2.15 Mya, a large asteroid (more than 1 km in diameter) was thought to have fallen into the Southern Ocean in the Eltanin Fault zone, generating a super-tsunami that resulted in large-scale marine extinction [87].In addition, affected by the mid-Pleistocene climate transition (MPT, 1.2-0.55Ma) feature (glacial-interglacial cycle alternating pattern), the population size of the whitespot parrotfish experienced a small expansion followed by a sustained contraction during this period [88,89].Interglacial global warming and glacier retreat may have contributed to the expansion of the whitespot parrotfish population by creating conditions for the recovery and expansion of coral reefs [90].The second decline in population size around 700 Ka may be related to the mid-Pleistocene climate transition that occurred approximately 0.9 Ma [89].Extreme glacial environmental conditions led to a general decline in the abundance and diversity of species prone to extinction.Our findings are similar to those of analyses on the population history of Catfishes conducted by other researchers [91].In summary, we found that the expansion and contraction of the whitespot parrotfish population all occurred before the last interglacial period [92].

Genome Evolution Analysis
Studies have shown that to best understand the evolution of the forms, functions, and behaviors of different organisms, it is essential to combine the phylogenetic information of a population with mechanistic studies of organismal function.This approach helps to create a comprehensive picture of the historical changes in key features [93].Scaridae, as one of the coral reef fish families, exhibited exceptional diversity in its body size, shape, coloration, feeding habits, reproductive behaviors, and life histories, all of which may be closely related to its evolutionary adaptations [24,94].To explore the evolutionary relationship between S. forsteni and other vertebrates, a phylogenetic tree was reconstructed.This was based on the analysis of 1196 single-copy orthologs derived from comparative genomics studies of the genomes of 15 species.According to our phylogenetic analysis, S. forsteni and C. undulatus, as well as N. celidotus and L. bergylta, had a common ancestor because they all belong to the Labridae family.Compared with other teleosts, they began to differentiate about 65.4 Mya, which led to significant changes in global temperature due to the alternation of glaciation and interglaciation [95].These climatic fluctuations, particularly during the warmer interglacial periods, led to sea level rises, which may have facilitated the expansion and recovery of coral reef habitats.This expansion of coral reefs likely provided new ecological niches, thereby potentially increasing the diversity of species adapted to these environments.Consequently, it is hypothesized that these environmental changes contributed to the diversification observed within the Labridae family [6,96].Next, it was found that S. forsteni and C. undulatus are closely related in phylogeny.In our phylogenetic analysis, at the genomic level, it was estimated that S. forsteni, C. undulatus, and their common ancestor diverged approximately 45.2 Mya.This is also a plausible time frame for Labridae species to have begun participating in adaptive evolution [8].However, the karyotypic differences that emerged during the evolutionary process have led to distinctions between the two species, including in morphological and dietary traits [7,70].Scarus species prefer to graze on coral, while Cheilinus species are more inclined to feed on sea grass, and Notolabrus species primarily target bivalves, crabs, and gastropods [7,97].Additionally, research has found that the Labridae are traditionally classified in the suborder Labroidei with the families Scaridae, Odacidae, Cichlidae, Pomacentridae, and Embiotocidae [98].Thus, Scaridae and Pomacentridae are sister groups, and the dichotomous evolutionary network pattern along the ecological morphological axis that they exhibit is similar to the findings of our study [24,94].The preservation of territorial behavior within their habitats may be attributed to their common evolutionary ancestry [99,100].

Enrichment of Expansion and Contraction Gene Families
Understanding the evolution of fish gene families is of significant importance for the conservation of the genetic resources of fish.Studies have contributed to unveiling the origins and evolutionary processes of biodiversity, consequently providing valuable insights for the development of conservation strategies and the sustainable management of fishery resources [63,70].In our analysis of the expansion and contraction of the gene families of whitespot parrotfish, the number of expanded gene families exceeded the number of contracted gene families.Through our KEGG enrichment analysis, we found that the expanded genes were primarily enriched in categories such as prion disease, Parkinson's disease, lipid and atherosclerosis, immune system, gap junction, mineral absorption, and phagosomes.Among them, a large number of enriched genes were found to belong to the disease defense system.As species adapt to new environments, they are constantly exposed to new pathogens and stimuli that cause the immune system to evolve [101,102].High levels of PUFA were found in Scarus rivulatus, which may be attributed to its frequent consumption of unsaturated fat from coral algae, crustaceans, and polychaetes.This diet likely induces the endogenous de novo biosynthesis of fatty acids and elongated chains, contributing to lipid and energy metabolism through the lipid and atherosclerosis pathway [103,104].At the same time, the mineral absorption pathway may promote the mineral utilization and metabolism of parrotfish for algae and debris scraped off the surface of coral reefs [11,13].The processing of ABC transporters and proteins in the endoplasmic reticulum, as the most basic biological process in body function, involves the transport of many kinds of molecules.It may be very important to the energy metabolism and excretion process of parrotfish.[105].In addition, we found that olfactory transduction in parrotfish showed a contractile state during the evolution of parrotfish.This observation becomes particularly interesting when considering the behavior of coral reef fish eggs.After two days of incubation, coral reef fish eggs floated away with the current, and they might have returned to their birth reef by smell when they became capable of swimming [106].These analyses demonstrated that the gene families of the parrotfish had undergone expansion and contraction during their evolutionary process, significantly influenced by environmental factors.This has enabled the physiological and behavioral traits of the parrotfish to adapt more effectively to ever-changing environmental conditions.

Analysis of Positive Selection Genes
Positive selection analysis was instrumental in identifying genes that had been subjected to natural selection pressures during the evolutionary process.These genes were often associated with improvements in a species' adaptability to specific environmental conditions, such as resistance to diseases, adaptation to particular ecological niches, or coping with environmental changes [107].Through the use of the KEGG and GO databases for our enrichment analysis of positively selected genes in parrotfish, it was found that these genes, which encode proteins, are mainly associated with cell transcription and translation, as well as sex hormone regulation.Additionally, there was notable representation in transcription machinery, translation factors, signaling pathways regulating the pluripotency of stem cells, the TGF-β signaling pathway, the estrogen signaling pathway, the regulation of post-embryonic development, and oocyte development.Sexual reversal, including the transformation of primordial females into preferential females, is a common phenomenon observed in most parrotfish species [15].In certain species, when a group of females lived together within the territory of a single large, brightly colored male, if that male was removed from the group, the largest female in the colony could undergo a process known as sexual reversal.This means that she changes both her coloration and behavior to take on the characteristics of a male [9].The enrichment of the estrogen signaling pathway, oocyte development, and the regulation of post-embryonic development has been suggested to be closely associated with sex differentiation, sexual maturation, and sexual behavior in parrotfish.This phenomenon has sparked significant interest among researchers, making the mechanisms of sex determination in vertebrates a highly debated and extensively studied topic [52,108,109].These results indicated that under the influence of natural selection, genes in S. forsteni associated with specific traits have evolved adaptively.

Conclusions
This study marks the first sequencing and analysis of the S. forsteni genome, providing a pivotal reference genome for the Scaridae family.Our in-depth analysis of the 1.71 Gb genome revealed high genomic integrity and a wealth of functionally annotated genes.Additionally, through comparative genomic approaches, we have uncovered key genetic factors influencing traits in the Labridae family, as well as enrichment pathways related to immune diseases, organismal systems, cellular processes, and sex reversal.These findings not only offer new insights into the evolutionary history of S. forsteni and its relatives but also lay the groundwork for future conservation and management strategies for coral reef fishes.We anticipate that future research will further explore the role of these genomic features in coral reef ecosystems and how they impact the adaptability and survival strategies of coral reef fishes.

Figure 2 .
Figure 2. Venn diagram showing the Scarus forsteni genome in five functional annotation databases of common and unique genes.

Figure 2 .
Figure 2. Venn diagram showing the Scarus forsteni genome in five functional annotation databases of common and unique genes.

Figure 3 .
Figure 3. (A) Correlation between genome size and transposable element content in teleosts.(B) Total amount and relative proportions of DNA transposons, LTRs (Long Terminal Repeats), LINEs, and SINE (Short Interspersed Nuclear Element) retrotransposons in teleost genomes.

Figure 3 .
Figure 3. (A) Correlation between genome size and transposable element content in teleosts.(B) Total amount and relative proportions of DNA transposons, LTRs (Long Terminal Repeats), LINEs, and SINE (Short Interspersed Nuclear Element) retrotransposons in teleost genomes.

Figure 4 .
Figure 4. Fluctuation in population size of Scarus forsteni between 100 Ma and 10 Ka (g: generational time in years, µ: every-generation mutation rate).

Figure 4 .
Figure 4. Fluctuation in population size of Scarus forsteni between 100 Ma and 10 Ka (g: generational time in years, µ: every-generation mutation rate).

Genes 2024 , 22 Figure 5 .
Figure5.The protein-coding genes of the total 15 species were clustered into 3234 gene families.Among these gene families, 1196 were single-copy gene families (one copy in each of these species).

Figure 5 .
Figure5.The protein-coding genes of the total 15 species were clustered into 3234 gene families.Among these gene families, 1196 were single-copy gene families (one copy in each of these species).

Figure 5 .
Figure5.The protein-coding genes of the total 15 species were clustered into 3234 gene families.Among these gene families, 1196 were single-copy gene families (one copy in each of these species).

Figure 6 .
Figure 6.Phylogeny construction based on the single-copy gene families.The black number represents the node differentiation time, the red number represents the expansion gene family number, and the blue number represents the contraction gene family number.The green layer represents outgroups, the orange layer represents other teleosts, and the yellow layer represents the Labridae family.

Figure 6 .
Figure 6.Phylogeny construction based on the single-copy gene families.The black number represents the node differentiation time, the red number represents the expansion gene family number, and the blue number represents the contraction gene family number.The green layer represents outgroups, the orange layer represents other teleosts, and the yellow layer represents the Labridae family.

Figure 7 .
Figure 7. GO term and KEGG enrichment analyses of Scarus forsteni expansion gene families.(A) GO term.(B) KEGG.

Figure 7 .
Figure 7. GO term and KEGG enrichment analyses of Scarus forsteni expansion gene families.(A) GO term.(B) KEGG.

Figure 7 .
Figure 7. GO term and KEGG enrichment analyses of Scarus forsteni expansion gene families.(A) GO term.(B) KEGG.

Figure 9 .
Figure 9. GO term and KEGG enrichment analyses of Scarus forsteni positive selection genes.(A) GO term.(B) KEGG.

Figure 9 .
Figure 9. GO term and KEGG enrichment analyses of Scarus forsteni positive selection genes.(A) GO term.(B) KEGG.

Table 2 .
BUSCO analysis statistics in the genome of Scarus forsteni.

Table 2 .
BUSCO analysis statistics in the genome of Scarus forsteni.

Table 3 .
Statistics of the predicted non-coding RNAs in the genome of Scarus forsteni.

Table 4 .
Repetitive element annotations in the genome of Scarus forsteni.

Table 3 .
Statistics of the predicted non-coding RNAs in the genome of Scarus forsteni.

Table 4 .
Repetitive element annotations in the genome of Scarus forsteni.