Chromosomal-Scale Genome Assemblies of Two Coastal Plant Species, Scaevola taccada and S. hainanensis—Insight into Adaptation Outside of the Common Range

While most of the species in Goodeniaceae family, excluding the Scaevola genus, are endemic to Australasia, S. taccada and S. hainanensis have expanded their distribution range to the tropical coastlines of the Atlantic and Indian Oceans. S. taccada appears to be highly adapted to coastal sandy lands and cliffs, and it has become invasive in places. S. hainanensis is found mainly in salt marshes near mangrove forests, and is at risk of extinction. These two species provide a good system to investigate adaptive evolution outside the common distribution range of this taxonomic group. Here, we report their chromosomal-scale genome assemblies with the objective of probing their genomic mechanisms related to divergent adaptation after leaving Australasia. The scaffolds were assembled into eight chromosome-scale pseudomolecules, which covered 90.12% and 89.46% of the whole genome assembly for S. taccada and S. hainanensis, respectively. Interestingly, unlike many mangroves, neither species has undergone whole-genome duplication. We show that private genes, specifically copy-number expanded genes are essential for stress response, photosynthesis, and carbon fixation. The gene families that are expanded in S. hainanensis and contracted in S. taccada might have facilitated adaptation to high salinity in S. hainanensis. Moreover, the genes under positive selection in S. hainanensis have contributed to its response to stress and its tolerance of flooding and anoxic environments. In contrast, compared with S. hainanensis, the more drastic copy number expansion of FAR1 genes in S. taccada might have facilitated its adaptation to the stronger light radiation present in sandy coastal lands. In conclusion, our study of the chromosomal-scale genomes of S. taccada and S. hainanensis provides novel insights into their genomic evolution after leaving Australasia.


Introduction
The family Goodeniaceae comprises approximately 400 species classified into twelve genera [1]. Excepting the Scaevola genus [2], the other eleven genera are almost all endemic to Australasia [3]. Scaevola taccada (Gaertner) Roxburgh (Scaevola sericea Vahl is a synonym of this species) and Scaevola hainanensis Hance are two Scaevola species distributed outside Australasia. Scaevola species vary considerably in floral morphology, fruit morph, and habitat, and their plant types vary from herbs, scramblers, and shrubs to small trees [2,3].
S. taccada is widespread through the Pacific and Indian Oceans, inhabiting almost the whole global tropical and subtropical coastline [2,4]. S. taccada was introduced to the Americas for ornamental use, with invasion and colonization in tropical areas of Mexico being reported in recent years [5]. S. taccada produces dimorphic fruits, with one type

Genome Assembly and Annotation
We combined the 10× genomics paired-end sequencing reads, paired-end short reads, and Hi-C (chromosome conformation capture) sequence data to de novo assemble the genomes of S. taccada and S. hainanensis. After assembly, correction, and polishing, the sequences were assembled into eight pseudo-chromosomes corresponding to their karyotype of 2n = 16. The final genome assembly of S. taccada was 1.11 Gb in length, with the pseudo-chromosomes ranging from 92 Mb to 150 Mb. The final genome assembly of S. hainanensis was 1. 24 Gb in length, with the pseudo-chromosomes ranging from 106 Mb to 159 Mb (Table 1, Figure 1). The sizes of these genome assemblies are larger than the estimated genome sizes from Genome Scope, though smaller than the values from Genome Character Estimator (Table S1 in supporting information). The pseudo-chromosomes covered 90.12% and 89.46% of the whole genome assembly for S. taccada and S. hainanensis, respectively, indicating a chromosomal-level assembling continuity.
Character Estimator (Table S1 in supporting information). The pseudo-chromosomes covered 90.12% and 89.46% of the whole genome assembly for S. taccada and S. hainanensis, respectively, indicating a chromosomal-level assembling continuity. We examined the assembly completeness by aligning paired-end short reads to the assemblies [20]. In total. 97.37% and 99.29% of the short reads were successfully mapped to the genome assemblies of S. taccada and S. hainanensis, respectively. Using Benchmarking Universal Single-Copy Orthologues (BUSCO) [21], 94.5% and 93.8% of the embryophyta core genes (the embryophyta_odb10 database) were identified as complete (single copy or duplicated) ( Table 1). The high proportions of complete BUSCOs support the completeness of our assemblies.
Using a combination of homology-based search and de novo prediction, we estimated that 61.25% and 69.76% of the S. taccada and S. hainanensis genome assemblies respectively consist of repetitive sequences (see Table S2 in the supporting information). With repetitive elements masked, 34,560 and 34,033 gene models were predicted and 30,016 (86.85%) and 31,136 (91.49%) gene models were functionally annotated in the S. taccada and S. hainanensis genomes, respectively (Table S3). We examined the assembly completeness by aligning paired-end short reads to the assemblies [20]. In total. 97.37% and 99.29% of the short reads were successfully mapped to the genome assemblies of S. taccada and S. hainanensis, respectively. Using Benchmarking Universal Single-Copy Orthologues (BUSCO) [21], 94.5% and 93.8% of the embryophyta core genes (the embryophyta_odb10 database) were identified as complete (single copy or duplicated) ( Table 1). The high proportions of complete BUSCOs support the completeness of our assemblies.
Using a combination of homology-based search and de novo prediction, we estimated that 61.25% and 69.76% of the S. taccada and S. hainanensis genome assemblies respectively consist of repetitive sequences (see Table S2 in the supporting information). With repetitive elements masked, 34,560 and 34,033 gene models were predicted and 30,016 (86.85%) and 31,136 (91.49%) gene models were functionally annotated in the S. taccada and S. hainanensis genomes, respectively (Table S3).

Whole Genome Duplication and LTR Retrotransposon Insertion
Whole-genome duplication (WGD) events and transposon insertions are important drivers of plant genome size expansion. We first sought evolutionary relics of WGD in S. taccada and S. hainanensis genomes by detecting paralogous synteny gene blocks based on all-to-all blastp alignments [22]. However, we only identified 1382 (4.00%) syntenic genes from 85 collinear blocks in the S. taccada genome and 1302 syntenic genes (3.83%) from 73 collinear blocks in the S. hainanensis genome ( Figure 1). The peak of synonymous nucleotide substitutions (Ks) of syntenic gene pairs corresponds to the paleohexaploidisation (γ-WGD) in the ancestors of all extant eudicots ( Figure 2b). These results agree with the previous view that the Goodeniaceae lineage has not experienced a WGD event after the γ-WGD event [23].
Long-terminal-repeat (LTR) retrotransposons were the predominant component among all repetitive elements, taking percentages of 57.13% and 42.44%, respectively (Table S2). Notably, the relatively smaller numbers of LTR retrotransposons and the larger number of unclassified transposons in the S. hainanensis genome may be due its genome assembly being more fragmented. We calculated the insertion times of LTR retrotransposon elements by comparing the 5 end and 3 end LTR sequences. The insertion of LTR retrotransposons started as early as six million years ago and increased significantly to the peak 1-2 million years ago (Figure 2c). This rapid insertion resulted in a large number of LTR retrotransposons reserved in the S. taccada and S. hainanensis genomes. Hence, we infer that LTR-RT insertion has likely contributed to the relatively large genome size of the two species and to adaptive evolution in their invasion of new distribution ranges outside of Australasia.

Phylogeny Construction and Gene Copy Number Evolution
We first clustered the genes of S. taccada, S. hainanensis, and nine other angiosperm species (Helianthus annuus, Lactuca sativa, Daucus carota, Arctium lappa, Erigeron canadensis, Coffea canephora, Arabidopsis thaliana, Eutrema salsugineum, and Nelumbo nucifera) into homologous gene groups, with 31,225 groups identified. Among these gene groups, 8266 were shared by all species and 658 were single-copy orthologous groups. In order to date the divergence between S. taccada and S. hainanensis, the 658 single-copy genes were used for phylogenetic tree construction. With the constructed phylogeny, we estimated that S. taccada and S. hainanensis diverged ~11 million years ago (Figure 2d).
We found 572 gene families containing 5771 genes and 703 gene families containing 4192 genes that were private to S. taccada and S. hainanensis, respectively ( Figure 2e). The functional enrichment analysis showed that both the S. taccada and S. hainanensis genes were significantly overrepresented by gene ontology (GO) terms related to stress response, such as oxidoreductase activity (GO:0016702), heat shock protein (K03283), and serine/threonine-protein kinase (K20718 in S. taccada and K17545 in S. hainanensis). The

Phylogeny Construction and Gene Copy Number Evolution
We first clustered the genes of S. taccada, S. hainanensis, and nine other angiosperm species (Helianthus annuus, Lactuca sativa, Daucus carota, Arctium lappa, Erigeron canadensis, Coffea canephora, Arabidopsis thaliana, Eutrema salsugineum, and Nelumbo nucifera) into homologous gene groups, with 31,225 groups identified. Among these gene groups, 8266 were shared by all species and 658 were single-copy orthologous groups. In order to date the divergence between S. taccada and S. hainanensis, the 658 single-copy genes were used for phylogenetic tree construction. With the constructed phylogeny, we estimated that S. taccada and S. hainanensis diverged~11 million years ago ( Figure 2d).
We found 572 gene families containing 5771 genes and 703 gene families containing 4192 genes that were private to S. taccada and S. hainanensis, respectively ( Figure 2e). The functional enrichment analysis showed that both the S. taccada and S. hainanensis genes were significantly overrepresented by gene ontology (GO) terms related to stress response, such as oxidoreductase activity (GO:0016702), heat shock protein (K03283), and serine/threonine-protein kinase (K20718 in S. taccada and K17545 in S. hainanensis). The genes private to S. taccada were significantly enriched in the process of photosynthesis (GO:0009767, GO:0009521) and transmembrane protein (K20724, K17086). These genes might have contributed to the growth and photosynthetic response (Tables S4 and S5). The genes private to S. hainanensis were significantly enriched in the carbohydrate metabolic process (GO:0005975), which might have contributed to nutrient utilization.
We found 62 gene families that are significantly expanded in S. hainanensis and significantly contracted in S. taccada. Genes in these families were overrepresented by GO terms related to proton and electron transport and to superoxide metabolism (Table S8, Figure 3a,b). The expansion of genes involved in proton and electron transport in S. hainanensis may contribute to its tolerance of high salinity in mangrove habitats. In contrast, 92 gene families are significantly expanded in S. taccada and significantly contracted in S. hainanensis. Genes in these families were significantly overrepresented by GO terms related to photosynthesis and carbon fixation (Table S8, Figure 3a,b). The copy number increase of genes involved in photosynthesis and carbon fixation is consistent with S. taccada growing to form tall trees in coastal sand and rock environments.

Genes under Positive Selection
We used the "branch-site model" in PAML v4.8 [17] to detect the genes under positive selection in S. taccada and S. hainanensis from a set of 4317 high-confidence orthologs. We found 515 (Table S9) and 542 (Table S10) genes under positive selection in the S. taccada and S. hainanensis lineage, respectively. The GO terms carbohydrate derivative biosynthetic process, cellular response to DNA damage stimulus, and DNA recombination are overrepresented in both lists of genes under positive selection (Tables S11 and S12, Figure 3c,d). However, GO terms related to metabolic processes, including carotenoid, alcohol, and superoxide metabolism, are only overrepresented in the genes under positive selection in S. hainanensis (Table S12, Figure 3c,d). These metabolism-related genes are likely important for S. hainanensis to respond to stress and to tolerate flooding and anoxic environments.
Particularly, certain genes are under positive selection on the common ancestor branch of the two species (Table S13). Nine genes that are under positive selection have functions related to light sensitivity and regulation (Table S14). These nine genes might contribute to the response to intense light stress, photosynthesis, and carbon fixation. For examples, XCT and APRR5 are circadian-associated. XCT encodes the XAP5 family protein, which is involved in light regulation of the circadian clock and photomorphogenesis. APRR5 encodes a pseudo-response regulator with mutations that affect various circadian-associated biological events.
Another 18 genes are involved in response to multiple stresses (Table S14). For example, three genes likely have functions in salt tolerance; VOZ1 positively regulates several salt-responsive genes, SIA1 is an atypical protein kinase induced by salt stress, and DELTA-OAT encodes an ornithine delta-aminotransferase, which responds to salt stress.

Genes under Positive Selection
We used the "branch-site model" in PAML v4.8 [17] to detect the genes under positive selection in S. taccada and S. hainanensis from a set of 4317 high-confidence orthologs. We found 515 (Table S9) and 542 (Table S10) genes under positive selection in the S. taccada and S. hainanensis lineage, respectively. The GO terms carbohydrate derivative biosynthetic process, cellular response to DNA damage stimulus, and DNA recombination are overrepresented in both lists of genes under positive selection (Tables S11 and S12, Figure  3c,d). However, GO terms related to metabolic processes, including carotenoid, alcohol, and superoxide metabolism, are only overrepresented in the genes under positive

Evolution of the FAR1 Gene Family in S. taccada and S. hainanensis
In both S. taccada and S. hainanensis, we found only four genes that are homologous to the PHYA-PHYE genes. That is to say, there is no copy number differentiation in phytochrome genes between the two species. In comparison, we identified 67 and 43 genes belonging to the FAR1 family in the genomes of S. taccada and S. hainanensis, respectively. These genes were named StaFAR1-StaFAR67 and ShaFAR1-ShaFAR43 following their chromosomal positions (Tables S15 and S16). The conserved domain motifs of FAR1 members were identified (Figure 4).   We aligned the FAR1 protein sequences of S. taccada, S. hainanensis, and A. thaliana to construct a phylogenetic tree using the maximum likelihood method. According to their phylogenetic relationship and motif pattern, these FAR1 genes were divided into five groups: Group I comprises 19 FAR1 members from S. taccada; Group II contains 28 and 24 FAR1 members from S. taccada and S. hainanensis, respectively; Group III contains 3, 3, and 2; Group IV contains 13, 11, and 11; and Group V contains 4, 5, and 4 members from S. taccada, S. hainanensis, and A. thaliana, respectively (Figure 4). We found 66 and 41 of these FAR1 genes located on the chromosome-anchored scaffolds in S. taccada and S. hainanensis, respectively. Moreover, these genes are evenly distributed on the chromosomes ( Figure 5).
We aligned the FAR1 protein sequences of S. taccada, S. hainanensis, and A. thaliana to construct a phylogenetic tree using the maximum likelihood method. According to their phylogenetic relationship and motif pattern, these FAR1 genes were divided into five groups: Group I comprises 19 FAR1 members from S. taccada; Group II contains 28 and 24 FAR1 members from S. taccada and S. hainanensis, respectively; Group III contains 3, 3, and 2; Group IV contains 13, 11, and 11; and Group V contains 4, 5, and 4 members from S. taccada, S. hainanensis, and A. thaliana, respectively (Figure 4). We found 66 and 41 of these FAR1 genes located on the chromosome-anchored scaffolds in S. taccada and S. hainanensis, respectively. Moreover, these genes are evenly distributed on the chromosomes ( Figure  5).
We inferred how the FAR1 gene family has increased their copy number, revealing two pairs of tandem duplication and 17 pairs of transposed duplication in S. taccada (Table  S17). In comparison, one pair was tandem duplicated, two pairs were proximally duplicated, and six pairs were transposed in S. hainanensis (Table S18). In short, transposed and tandem duplication have played a key role in FAR1 gene family expansion.

Discussion
The family Goodeniaceae has recently experienced evolutionary radiation, leading to the current species diversity of about 400 species. The Scaevola genus consists of ~130 species worldwide, with the center of origin and distribution being Australasia. There are about 40 Scaevola species that are distributed outside Australasia. Both S. taccada and S. plumieri are widespread across tropical regions, while other species are scattered in tropical islands. S. hainanensis is distributed in the coastal areas from southern China to northern Vietnam. S. taccada is the only Saevola species with a distribution overlapping with S. hainanensis. We inferred how the FAR1 gene family has increased their copy number, revealing two pairs of tandem duplication and 17 pairs of transposed duplication in S. taccada (Table S17). In comparison, one pair was tandem duplicated, two pairs were proximally duplicated, and six pairs were transposed in S. hainanensis (Table S18). In short, transposed and tandem duplication have played a key role in FAR1 gene family expansion.

Discussion
The family Goodeniaceae has recently experienced evolutionary radiation, leading to the current species diversity of about 400 species. The Scaevola genus consists of~130 species worldwide, with the center of origin and distribution being Australasia. There are about 40 Scaevola species that are distributed outside Australasia. Both S. taccada and S. plumieri are widespread across tropical regions, while other species are scattered in tropical islands. S. hainanensis is distributed in the coastal areas from southern China to northern Vietnam. S. taccada is the only Saevola species with a distribution overlapping with S. hainanensis.
Polyploidy and transposon insertions are two important drivers of genome evolution, providing genomic materials for evolutionary innovation [24][25][26]. Whole-genome duplication events have been demonstrated to contribute to species divergence and morphological diversification in Asterids [23]. A combination of Ks-based and synteny analyses showed that S. taccada and S. hainanensis have experienced no recent WGD events after the well-known paleohexaploidisation (γ-WGD) event. However, their genomes consist of large proportions of LTR-RTs which were inserted into the genomes within the past several million years. These recent and rapid LTR-RT insertions might have been the main driver of genomic evolution which helped these species to move out of Australasia.
Plants growing in coastal regions are often subject to UV, high salt stress, and periodic tides. The different rates of gene gain and loss usually cause the copy number of homologous gene groups to vary among species, which provides the genetic basis for phenotypic innovation, species diversification, and genome size evolution [27,28]. Both the genes private to and the gene groups expanded in S. taccada and S. hainanensis are enriched in the GO and KEGG terms related to photosynthetic, heat shock protein, and serine/threonine-protein kinase. Hence, gene copy evolution might have contributed to abiotic stress signaling and responses in these two species.
We found that genes under positive selection have contributed to the maintenance of photosynthesis, response to stress, and regulation of the circadian clock, which is important for S. taccada and S. hainanensis to adjust to the light and tidal rhythm. We concluded that these private genes, copy-number expanded genes, and positively-selected genes are essential for the two species to adapt to their extant habitats.
We identified both S. taccada and S. hainanensis as having more FAR1 gene copies than closely related species (7 in Daucus carota, 1 in Helianthus annuus, 10 in Lactuca sativa, and 24 in Solanum pennellii; data from PlantTFDB [29]). Tandem duplication and transposed duplications have made the key contribution to the copy number expansion of FAR1 genes in these two species. In particular, S. taccada has more FAR1 genes than S. hainanensis. The larger number of FAR1 genes in S. taccada may be beneficial because the S. taccada trees are more likely to be injured by strong light in their coastal sandy land habitats as compared with the mangrove understory habitat of S. hainanensis.
In conclusion, we have reported the first chromosome-level genome assemblies of S. taccada and S. hainanensis. By comparative genomic analysis, we have revealed the mechanisms underlying their adaptation outside of their common range in Australasia. Our genomic analyses provide insights into their adaptation to different environments in intertidal zones and coastal sands. To the best of our knowledge, the two genome assemblies reported here are the first available public genome resource on the Goodeniaceae family. Our genome research provides a reference for studies of the invasion ecology of S. taccada and conservation ecology of S. hainanensis. Moreover, our genome resource can benefit subsequent studies on genetic diversity, demographic history, radiogenic evolution, hybridization, and biogeographic processes in these species.

Plant Material
The Scaevola taccada and S. hainanensis materials used for genome sequencing were collected from Dongzhai Harbor (110 • 38 N), Haikou, Hainan province, China, respectively. Total genomic DNA was extracted from leaves following the CTAB method [30]. The genomic DNA sample was quantified using a Nanodrop 2000 Spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and assessed by 1% agarose gel electrophoresis. To assist genome annotation, RNA was extracted from the flowers and leaves of S. hainanensis and the flowers, fruit, and leaves of S. taccada following the modified CTAB method [31]. These genomes were sequenced as part of a project aiming to sequence mangrove species worldwide [32].

10× Genomics Sequencing
Only reads over 50 Kb long were retained for 10× genomics library construction. The prepared 10× genomics library was then sequenced as 150 bp paired-end reads on a BGISEQ-500 platform. Data quality control was performed using SOAPnuke v 2.7.1 [33] software, yielding c. 109.06 Gb of bases for S. taccada and c. 81.48 Gb of bases for S. hainanensis, respectively.

Hi-C Sequencing
Tender leaves were collected for Hi-C libraries construction. The leaves were fixed with formaldehyde and lysed and the cross-linked DNA was digested with MboI. Then, 100-bp paired-end Hi-C libraries were constructed and sequenced on a BGISEQ-500 platform. After removing low-quality reads with SOAPnuke v2.7.1 software, we yielded c. 125.54 Gb and 132.63 Gb of bases for S. taccada and S. hainanensis, respectively.

Short-Read Sequencing
DNA short-read sequencing was performed for gap-filling and error correction. Purified DNA was used for 150-bp paired-end libraries construction and then sequenced on a BGISEQ-500 platform. After quality control by SOAPnuke v2.7.1, we yielded c. 84.52 Gb and 103.02 Gb of bases for S. taccada and S. hainanensis, respectively.
The RNA-seq libraries were sequenced on a BGISEQ-500 platform for gene prediction. For S. taccada, we yielded c. 5.59 Gb, 5.86 Gb, and 6.64 Gb of bases from sequencing of flowers, fruits, and leaves, respectively. For S. hainanensis, we yielded c. 6.96 Gb sequences from sequencing of flowers and c. 11.67 Gb from leaves.

De Novo Genome Assembling
Before assembling, we used the short reads to estimate genome size using 17-mer of jellyfish v2.2.10 [34] and genomic character estimator (GCE) v1.0.2 [35]. The qualityfiltered reads were then used for genome assembling. We first assembled the S. taccada and S. hainanensis genomes based on the 10× reads using Supernova v2.0.1 [36]. The draft assembly was corrected and gap-filled based on paired-end short reads using GapCloser v1.2.1 [37]. Hi-C reads were qualified to improve the assemblies using HiC-Pro v3.2 [38] and then mapped to the polished 10× genomes to obtain valid reads with contact information. The Hi-C maps were generated by Juicer v3 [39]. The scaffolds were visualized and manually adjusted by Juicebox v1.8.0 [40] and anchored to the chromosome level using the 3D de novo genome assembly (3D-DNA) pipeline [41].

Collinearity Analysis and Whole-Genome Duplication (WGD) Analysis
The homologous genes within each of S. taccada and S. hainanensis, as well as between each of S. taccada, S. hainanensis and H. annuus, were identified by all-vs-all BLASTP with a cutoff e-value of 10 −5 , identity ≥ 40%. Collinear blocks were detected by MCScanX [54] with default parameters. The genomic distribution of collinear blocks was visualized using Circos v0.69-9 [55]. Protein sequences in the collinear blocks were extracted and aligned using MAFFT v7.480, and codon sequences were obtained using PAL2NAL v14 [56]. Synonymous substitution rates (Ks) were calculated by KAKS_calculator v2.0 with the YN substitution model [57].

Phylogeny Construction and Divergence Time Estimation
To estimate divergence times, we used OrthoFinder v2.5.4 to identify orthologous gene groups for S. taccada, S. hainanensis, and nine other angiosperms species (Helianthus annuus, Lactuca sativa, Daucus carota, Arctium lappa, Erigeron canadensis, Coffea canephora, Arabidopsis thaliana, Eutrema salsugineum, and Nelumbo nucifera). All proteins from these eleven species were merged to perform an all-to-all alignment using BLASTP with a cutoff e-value of 1 × 10 −10 . Single-copy orthologs of the eleven species were aligned using MAFFT v7.480, and codon sequences were obtained using PAL2NAL v14. The alignments were trimmed using GBLOCKS v0.91b [58] using default parameters. We used JMOD-ELTEST2 [59] to select an appropriate nucleotide-substitution model for reconstructing the phylogeny. To reconstruct the phylogenetic tree, we used PhyML v3.3 [60] with the best-fit model (GTR + I + G) and 1000 bootstrap replicates. Finally, the program MCM-CTREE from the PAML4.8 package was employed to estimate divergence times, using "seq like (usedata = 1)", "JC69 (model = 0, alpha = 0)", and "independent rates (clock = 2)". The root node of eudicots and monocots was constrained between 125-247 Mya and the common ancestor of eudicots was placed at 119.6-128.63 Mya based on two reliable fossil calibrations [61]. The MCMC analyses were run for 500,000 generations and sampled every 10 generations, using a burn-in of 200,000 iterations.

Gene Group Evolution and Positive Selection Detection
Gene copy number evolution was examined using CAFE v4.2.1 [62]. We obtained the counts of genes in each group from OrthoFinder v2.5.4. Gene families with more than 100 gene copies in a single species were removed from this analysis. Expansion or contraction of a gene group was identified as significant with a p-value < 0.01.
To detect genes under positive selection, orthologs were identified from the 11 species mentioned above. Orthologs were retained according to the following criteria: (1) orthologous groups with a single copy in each species or (2) low-copy orthologs that had only a single copy in the outgroup (A. thaliana) and less than five copies in all other species. The latter type of ortholog was further identified using the following method: first, we aligned the single-copy gene of the outgroup against all copies from other species with BLASTP; then, the best hit was used as the representative ortholog. The orthologs were aligned using MAFFT v7.480 and the aligned peptide sequences were transformed to nucleotide acid codon sequences using PAL2NAL v14. The codon alignments were used as input for CODEML in the PAML 4.8 package to detect genes under positive selection, with the branch of S. taccada, S. hainanensis, or their common ancestor as the foreground branch. The likelihood ratio test (LRT) p-values were computed using the χ 2 statistic with a freedom degree of one. To remove false discoveries, the p-values were transformed to Q-values using Benjamini-Hochberg method [63] for multiple test correction (FDR < 0.05). Functional annotations of the positively-selected genes were obtained from the TAIR database. GO and KEGG enrichment analyses of expanded and species-private gene groups and genes under positive selection were performed using the ClusterProfiler v.4.0.2 package [64].

Analyses of the FAR1 Gene Family
The structural domains of FAR1 (PF03101) were obtained from the Pfam database (http: //www. ebi.ac.uk/Tools/hm, accessed on 4 March 2023). HMMER v3.3.2 software [65] was used to search FAR1 domains based on the hidden Markov model (HMM) file (PF03101) with standard parameters (e-values < 10 −5 ). Then, the obtained FAR1 candidates were rechecked using the NCBI CDD (Conserved Domains) database (https://www.ncbi.nlm. nih.gov/cdd, accessed on 4 March 2023) and SMART (http://smart.embl-heidelberg.de/, accessed on 4 March 2023) to remove false positive findings. The sequences of FAR1 genes in A. thaliana were downloaded from the PlantTFDB dataset. The conserved domain motifs of FAR1 members were identified by MEME v4.12.0 [66], with the number of motifs set to 10, the minimum length of motifs set to 6, and the maximum length set to 100.
The FAR1 protein sequences of S. taccada, S. hainanensis, and A. thaliana were aligned using MAFFT v7.480. The phylogenetic tree was constructed using the maximum likelihood (ML) algorithm by IQ-TREE v2.2.0.3 [67] with 1000 bootstrap values. To determine the physical position of FAR1 genes on chromosomes, all the FAR1 genes were mapped to the corresponding reference genome using TBtools v1.068 [68]. Gene duplication events were analyzed by DupGen_finder [69] with default parameters, using H. annuus as the outgroup.