Improved Cladocopium goreaui Genome Assembly Reveals Features of a Facultative Coral Symbiont and the Complex Evolutionary History of Dinoflagellate Genes

Dinoflagellates of the family Symbiodiniaceae are crucial photosymbionts in corals and other marine organisms. Of these, Cladocopium goreaui is one of the most dominant symbiont species in the Indo-Pacific. Here, we present an improved genome assembly of C. goreaui combining new long-read sequence data with previously generated short-read data. Incorporating new full-length transcripts to guide gene prediction, the C. goreaui genome (1.2 Gb) exhibits a high extent of completeness (82.4% based on BUSCO protein recovery) and better resolution of repetitive sequence regions; 45,322 gene models were predicted, and 327 putative, topologically associated domains of the chromosomes were identified. Comparison with other Symbiodiniaceae genomes revealed a prevalence of repeats and duplicated genes in C. goreaui, and lineage-specific genes indicating functional innovation. Incorporating 2,841,408 protein sequences from 96 taxonomically diverse eukaryotes and representative prokaryotes in a phylogenomic approach, we assessed the evolutionary history of C. goreaui genes. Of the 5246 phylogenetic trees inferred from homologous protein sets containing two or more phyla, 35–36% have putatively originated via horizontal gene transfer (HGT), predominantly (19–23%) via an ancestral Archaeplastida lineage implicated in the endosymbiotic origin of plastids: 10–11% are of green algal origin, including genes encoding photosynthetic functions. Our results demonstrate the utility of long-read sequence data in resolving structural features of a dinoflagellate genome, and highlight how genetic transfer has shaped genome evolution of a facultative symbiont, and more broadly of dinoflagellates.


Introduction
Dinoflagellates of the family Symbiodiniaceae [1] are diverse microalgae, with many forming symbiotic relationships that are critical to corals and other coral reef organisms. Symbiodiniaceae provide carbon fixed via photosynthesis and other essential nutrients to coral hosts [2,3]. Environmental stress leads to breakdown of this partnership and loss of the algae, i.e., coral bleaching, putting the corals at risk of starvation, disease, and potential death [4]. Recent studies of Symbiodiniaceae genomes have revealed extensive sequence and structural divergence [5][6][7][8], and potentially a greater, yet-to-be recognised phylogenetic diversity among these taxa [9,10]. A recent comparative analysis of genomes from 18 dinoflagellate taxa (of which 16 are Symbiodiniaceae) revealed distinct phylogenetic signals between genic and non-genic regions [11], indicating differential evolutionary pressures acting on these genomes. These findings illustrate how the evolutionary complexity of Symbiodiniaceae genomes may explain their diverse symbioses and ecological niches [12].
Cladocopium (formerly Clade C) is the most taxonomically diverse genus of family Symbiodiniaceae and found predominantly in the Indo-Pacific, with Cladocopium goreaui (formally type C1) being a dominant species in the region [13,14]. The earlier genome analysis of C. goreaui SCF055 [7] revealed the genetic capacity of the species to establish and maintain symbiosis with coral hosts, respond to stress, and to undergo meiosis; i.e., many of the implicated genes show evidence of positive selection. Although these results provide insights into the adaptive evolution of genes, the assembled genome, generated using only Illumina short-read data, remains fragmented with 41,289 scaffolds [7]. Additional analysis of the draft genome also indicated that some scaffolds may be of bacterial origin due to their anomalous G+C content [15]. For these reasons, the existing data limit our capacity to reliably assess repetitive genomic elements and evolutionary origins of the predicted genes.
Here, we present an improved, hybrid genome assembly for C. goreaui, combining novel PacBio long-read sequence data with the existing short-read sequence data from Liu et al. [7], and incorporating a new full-length transcriptome to guide gene prediction. Incorporating proteins predicted from the genome with those from 96 taxonomically broadly sampled eukaryote taxa and representative prokaryotes in a phylogenomic analysis, we assessed the evolutionary origins of genes in C. goreaui and other dinoflagellates, and the impact of horizontal genetic transfer (HGT) in shaping the evolution of this lineage. The earlier investigation based on transcriptome data from the bloom-forming, toxin-producing dinoflagellate Alexandrium tamarense [16] revealed evidence of HGT, implicating both prokaryote and eukaryote donors, in 14-17% of investigated protein trees. Few genes and no genome data were available from other dinoflagellates when that study was conducted. However, the genomic and genetic resources of dinoflagellates have grown appreciably in the last decade, enabling a more-balanced representation of taxonomic diversity to support such an analysis. In this study, we examine how the nuclear genome of C. goreaui, and broadly that of dinoflagellates, has evolved and benefited from the acquisition of genomic (and functional) novelties through HGT.

Improved C. goreaui Genome Assembly Reveals More Repeats and More Duplicated Genes
We generated PacBio long-read genome data (50.2 Gbp; Table S1) for C. goreaui SCF055 and combined them with existing Illumina short reads in a hybrid approach to generate a de novo genome assembly (see Methods). The first published genome assembly of the SCF055 isolate [7] was previously refined to exclude putative contaminant sequences [15]. Compared to the draft assembly reported in Chen et al. [15], our assembly exhibits a five-fold decrease in the number of scaffolds (6843) and a three-fold increase in scaffold N50 length (354 Kbp; Table 1). Genome size was estimated at 1.3 Gbp based on k-mers ( Figure S1), and our improved assembly (1.2 Gbp in size) is larger than the earlier draft (1.0 Gbp; Table 1).
We also generated 65,432 near-full-length transcripts using PacBio Iso-Seq to guide prediction of protein-coding genes. Using the same approach tailored for dinoflagellates [15], we predicted 45,322 protein-coding genes (mean length 15,745 bp) in the genome, compared to 39,066 (mean length 8428 bp) reported in Chen et al. [15]. The majority (82.4%) of predicted genes are supported by transcript evidence, and genome completeness is markedly improved, evidenced by the 15.1% greater recovery of core conserved genes (BUSCO alveolate_odb10) [17] at 82.4% (Table 1). Most predicted proteins (40,495; 89.3%) have UniProt hits based on sequence similarity (BLASTp; E ≤ 10 −5 ); 19,904 (43.9%) have hits in the curated Swiss-Prot database, 8836 (19.5%) covering > 90% of full-length Swiss-Prot proteins. The remaining 4827 (10.7%) C. goreaui proteins have no significant hits in UniProt, indicating the prevalence of "dark" genes that encode functions yet to be discovered. We identified and compared repeat content in the C. goreaui genome with that in the earlier assembly of Chen et al. [15]. Excluding simple repeats, we found a higher repeat content (36.5% of total bases in the assembled genome; Figure 1a) in the current assembly than in the initial data (21.1%), with known repeat types accounting for 17.3% of total bases, compared to 5.6%. This result indicates a better resolution of repetitive regions in the revised genome with the incorporation of long reads. Novel, Symbiodiniaceaespecific repeats would remain as unclassified in this instance due to the lack of identified dinoflagellate repeats in the current database. Among known repeat types, long terminal repeats (LTR) and DNA transposons are the most prevalent repeat families (constituting 7.3% and 6.2% of total bases, respectively). These two repeat families exhibit distinct levels of sequence divergence; those with Kimura substitution values centred between 0 and 10 are more conserved than those with values centred between 15 and 25. Most LTRs are highly conserved in C. goreaui, a trend also observed in the genomes of other dinoflagellates [10,18].
Collinear gene blocks within a genome represent duplicated gene blocks, e.g., via segmental or whole-genome duplication. Based on the recovery of these gene blocks, we identified a greater proportion of duplicated genes than Chen et al. [15]: 35,119 (77.5% of 45,322) genes in duplicates, compared to 25,550 (65.5% of 39,006; Table 2). We found 31,827 (70.2%) genes in dispersed duplicates, suggesting a lack of conserved collinearity of genes in the C. goreaui genome. The lack of collinearity of duplicated genes lends support to the hypothesized extensive structural rearrangements in genomes of facultative symbionts in the family Symbiodiniaceae [12]; more genome data from non-facultative or free-living taxa will allow for a more-robust testing of this hypothesis. With the improved structural annotation, we recovered a 2.39-fold greater number of tandemly repeated genes, and 387 genes (in 34 collinear blocks) implicated in segmental duplications (Table 2). Tandemly repeated genes in dinoflagellates are thought to be a mechanism for improving transcriptional efficiency, particularly for genes encoding critical functions [18]. Comparing gene ontology (GO) terms annotated in the 1998 tandemly repeated genes against those in all C. goreaui genes, the top three enriched terms for Cellular Component (Table S2) are "chloroplast thylakoid membrane" (GO:0009535; p = 1.5 × 10 −7 ), "photosystem I reaction center" (GO:0009538; p = 5.4 × 10 −7 ) and "photosystem II" (GO:0009523; p = 0.00042). This result indicates that genes encoding photosynthetic functions tend to appear in tandem repeats in C. goreaui. We applied the same approach to genes in segmental duplications and found that the most significantly enriched Biological Process term for this set is "recombinational repair" (GO:0000725, p = 7.2 × 10 −5 , Table S3). The repair of errors during genetic recombination is essential for maintaining genome integrity; conservation of collinear organization of these genes may be key in ensuring their transcription efficiency.  [15]. (d) Frequency of strand-orientation change in 10-gene windows and (e) cumulative percentage of genes in unidirectionally encoded blocks, shown for five representative Symbiodiniaceae genomes: Breviolum minutum [15], C. goreaui [15], C. goreaui (boldfaced, this study), Fugacium kawagutii [19], and Symbiodinium microadriaticum [20]; and (f) number of inter-block regions in each genome assembly indicating putative TAD central regions and boundaries, shown for representative genomes, based on the minimum number of genes in a unidirectional block. Bars above the x-axis represent inter-block regions at which orientations of unidirectional blocks converged, whereas bars below the x-axis represent those at which the orientations diverged.

Topologically Associated Domains (TADs) and Unidirectional Gene Blocks
Recent studies have clarified interacting genomic regions via topologically associated domains (TADs) of dinoflagellate chromosomes [20,21]. Orientations of unidirectionally encoded gene blocks diverge from a TAD central region, converging at TAD boundaries [20]. Regulatory elements such as promoters and enhancers of gene expression are hypothesised to be concentrated in these regions to regulate transcription of the upstream or downstream unidirectional gene blocks [22]. To assess putative TAD regions in the revised C. goreaui genome, we first identified the unidirectionally encoded gene blocks. We followed the method of Stephens et al. [18] to enumerate gene-orientation change(s) in a ten-gene window, moving across the entire genome assembly; the tendency for no change in gene orientation is an indication for the prevalence of unidirectional encoding. We performed this analysis in a set of representative Symbiodiniaceae genomes ( Figure 1d). Interestingly, we observed a lower percentage (34.7%) of ten-gene windows with conserved orientation in the revised C. goreaui genome, when compared to 44.6% in the assembly of Chen et al. [15]. The equivalent percentages in the more-contiguous, near-chromosomal level genome assemblies of S. microadriaticum [20] (13.9%) and F. kawagutii [19] (34.5%) are also lower, compared to 46.1% in the more-fragmented assembly of B. minutum [15] (Figure 1d). However, when we assessed the sizes of unidirectional gene blocks in these genomes, they are clearly larger in the more-contiguous assemblies (Figure 1e). For instance, 32.6% of genes in C. goreaui are found in block sizes of 10 or more genes, compared to only 7.6% in the earlier assembly ( Figure 1e). These observations indicate that the lower recovery of ten-gene windows with conserved orientation is caused by the increased recovery of windows spanning two gene blocks with opposing orientations, as expected in TAD central or boundary regions, in the more-contiguous assemblies. In this way, the more-contiguous assembly enables better recovery of putative TAD regions.
To assess putative TAD regions, we examined genomic regions between any two unidirectional gene blocks. We identified these regions requiring the gene blocks on either side to contain at least N number of genes, where N is 4, 6, 8, or 10. Figure 1f shows the recovery of these regions across threshold N in the representative genomes, with those with converging gene-block orientations (i.e., putative TAD boundaries) above the x-axis, and those with diverging orientations (i.e., putative TAD central regions) below the xaxis. We recovered approximately 6-to 30-fold larger numbers of these regions in the more-contiguous C. goreaui assembly (e.g., 327 putative TAD boundaries) and in nearchromosomal level assemblies of S. microadriaticum (340) and F. kawagutii (454), than in the more-fragmented assemblies of B. minutum (59) and C. goreaui (15). The implicated unidirectional gene blocks on either side of these regions are also larger, e.g., at N = 10, we identified 25 putative TAD boundaries in C. goreaui, compared to only two in the earlier assembly; the assembly of F. kawagutii shows the greatest recovery of TAD-associated regions, with 129 putative TAD boundaries implicating blocks of 10 or more genes on either side. TAD boundaries have been reported to exhibit a dip in GC content in the middle of the sequence [20]. We observed such a dip in GC content in 17/25 putative TAD boundary regions (at N = 10) in C. goreaui; an example is shown in Figure S2. Interestingly, our recovery of TAD-associated regions in C. goreaui is very similar to that in the chromosome-level assembly of S. microadriaticum (Figure 1f), suggesting that our revised assembly, although not derived specifically using chromosome configuration capture, e.g., in Nand et al. [20], resolves a comparable number of TAD regions.
We also identified genes that tend to disrupt the unidirectional coding of gene blocks, based on their distinct orientation from upstream and downstream genes; such disruptions have been observed in the chromosome-level genome assembly of S. microadriaticum [20]. We identified 3799 (8.4%) of such genes in C. goreaui. Interestingly, these genes largely encode functions related to transposon elements. Comparing the annotated GO terms in these genes versus those in all C. goreaui genes, the two most significantly enriched terms for Molecular Function are "nucleic acid binding" (GO:0003676; p < 1.0 × 10 −30 ) and "RNA-DNA hybrid ribonuclease activity" (GO:0004523; p = 1.8 × 10 −20 ; Table S4), and the most significantly enriched term for Biological Process is "DNA integration" (GO:0015074; p = 7.6 × 10 −7 ; see Table S4). We found no expression evidence for genes encoding transposable elements that disrupt the orientation of unidirectional gene blocks. In contrast, gene encoding transposable elements that occur in the same orientation within unidirectional blocks were actively expressed. This observation suggests that unidirectional encoding is essential for gene expression in C. goreaui, and potentially also in dinoflagellates.

Evolutionary Origin of C. goreaui Genes
The predicted genes from the improved C. goreaui genome assembly provide an excellent analysis platform to assess their evolutionary origins. To assess the overall phylogenetic signal of C. goreaui genes relative to other dinoflagellates, we inferred a dinoflagellate phylogeny (Supplementary Figure S3) using 3411 single-copy, strictly orthologous protein sets identified using 1,468,870 sequences from 30 dinoflagellate taxa including C. goreaui, plus 22,958 sequences from Perkinsus marinus as an outgroup (Table S5; see Methods). This phylogeny is congruent with the established phylogeny of dinoflagellates [23,24] with distinct strongly supported clades (bootstrap support [BS] ≥ 90%). C. goreaui is placed in a well-supported (BS 100%) clade of family Symbiodiniaceae, and within the order Suessiales (BS 100%) to which the family belongs. This result confirms the phylogenetic position of C. goreaui in the Symbiodiniaceae based on putative orthologous proteins, a result that has been demonstrated recently based on whole-genome sequence data using an alignment-free phylogenetic approach [11].
We then assessed the evolutionary origin of individual C. goreaui genes using protein data. Using 2,841,408 predicted protein sequences from 96 taxa of eukaryotes and prokaryotes (Table S5), we identified 177,346 putative homologous proteins sets based on sequence similarity (see Methods). Of the 45,322 C. goreaui proteins, 22,026 (48.6%) are specific to dinoflagellates (i.e., 3021, 8748, and 10,257 are specific to C. goreaui, order Suessiales, and Dinophyceae, respectively; Figure 2a). Dinophyceae-specific proteins are those found in one or more Dinophyceae taxa that may also include Suessiales. Although we cannot dismiss biased taxon sampling due to the scarcely available data from other alveolate taxa, these results indicate extensive lineage-specific innovation of gene functions following speciation or divergence of dinoflagellates, supporting the notion of extreme divergence of dinoflagellate genes [5,9,10,24].
We found 4601 (10.2% of 45,322) proteins to be shared exclusively with another phylum, in 1544 homologous sets (Figure 2b). Assuming that inadequate sampling is less of a concern in sets that contain a larger number of genes, we adapted the approach by Chan et al. [16] to study these putative gene-sharing partners (phylum) with dinoflagellates, based on the minimum number of genes (x) in each set, at x ≥ 2, ≥ 20, ≥ 40, and ≥ 60. At x ≥ 2, the most frequent gene-sharing partners for dinoflagellates are Chromerida, Perkinsea, and other alveolates (534), followed by Stramenopiles (195), Haptophyta (162), and Archaeplastida (146). This result supports the current phylogeny of dinoflagellates in the supergroup Alveolata, the Stramenopiles+Alveolata+Rhizaria (SAR) clade [25][26][27], and their close association with haptophytes [28] and Archaeplastida via endosymbiosis implicated by their plastid origin [29,30]. The earlier study based on transcriptome data from the bloomforming dinoflagellate Alexandrium tamarense [16] revealed a decrease in the proportion of dinoflagellate genes shared with alveolates as x increased. This trend is not observed here, e.g., the percentage of genes showing exclusive sharing with alveolates is 34.6%, 37.9%, 38.5%, and 37.3% at x ≥ 2, ≥ 20, ≥ 40, and ≥ 60, respectively. This result suggests that the phylogenetic signal observed here is more consistent than in the earlier study, boosted by the greater representation of dinoflagellate taxonomic diversity (30 taxa in Table S5). The remaining 18,695 (41.2%) C. goreaui proteins were recovered in 5795 homologous sets containing two or more phyla. To assess the evolutionary origins of these genes, we inferred a phylogenetic tree for each of these homologous protein sets; 5246 remained after passing the initial composition chi-squared test in IQ-TREE to exclude sequences for which the character composition significantly deviates from the average composition in an alignment (see Methods). Among the 5246 trees, we adopted a computational approach [31] to identify those in which Dinophyceae taxa form a strongly supported clade with one other phylum, based on observed BS at ≥ 90%, ≥ 70%, and ≥ 50% (Figure 2c; see Methods and Table S6 for detail of our tree-sorting strategy); clades observed at a higher BS threshold represent higher confidence results. All sorted trees using the three thresholds are available as Supplementary Data 1. We identified 2080, 2414, and 2605 trees that fit these criteria at BS ≥ 90%, ≥ 70%, and ≥ 50% (Figure 2c); the classification of evolutionary origin for each sorting process is shown in Table S6. The proportions of distinct putative origins for the protein sets are similar, e.g., those with putative alveolate origin are 24.9%, 24.8%, and 24.0% at BS ≥ 90%, ≥ 70%, and ≥ 50% (Figure 2c), reflecting the consistent phylogenetic signal we recovered from these data. Remarkably, the evolutionary history of dinoflagellate proteins in more than one-half of the analysed 5246 trees are too complicated to be classified using our approach, e.g., 2641 (50.3%) even at our least-stringent threshold of BS ≥ 50%. Some of these proteins (e.g., acyl-CoA dehydrogenase and GTP-binding protein of YchF family) are thought to have a shared origin with fungi or pico-prasinophytes [16], which are likely to be artefacts due to limited dinoflagellate genome data and sampling bias.

Genes Implicating a History of Horizontal Transfer
Trees containing a strongly supported monophyletic clade of dinoflagellates and taxonomically remote phyla (Categories F through O in Figure 2c) suggest a history of HGT; they account for 35.1%, 34.9%, and 36.4% of sorted trees at BS ≥ 90%, ≥ 70%, and ≥ 50%. The proportion of HGT-implicated trees is greater than that (14-17%) in the earlier study based on the transcriptome of Alexandrium tamarense [16]. In this study, a more taxonomically balanced set of eukaryote taxa was used, including a larger representation of dinoflagellates and red algae. Therefore, the biases introduced by poor taxon-sampling are diminished, as demonstrated by the consistent phylogenetic signal that we captured at different stringency levels.
Dinoflagellates possess secondary (and some tertiary) plastids independently acquired from several algal lineages through endosymbioses [30,32]. Genes from the endosymbionts were postulated to have been transferred to the host nuclear genome during this process. The implicated endosymbionts include the ancestral Archaeplastida lineages of red and/or green algae, and potentially other eukaryotic microbes, e.g., haptophytes, allowing genetic transfer between dinoflagellates and these algal lineages during these events [28]. Secondary plastids found in dinoflagellates and diatoms (stramenopiles) are thought to have arisen from an ancestral red alga [33]; both red and green algal derived genes have been described in these taxa [16,34,35].
Here, we focus on the high-confidence trees (clade recovery at BS ≥ 90%) as strong evidence for HGT. Of the 2080 trees, 402 (19.3%) putatively derived from Archaeplastida (groups F through I in Figure 2c): 73 from Rhodophyta (F), 199 from Viridiplantae (G), and 130 from any combination of Archaeplastida taxa (H and I). At the less-stringent BS threshold, this number is 589 (22.6% of 2605). As the most plausible explanation, C. goreaui (and dinoflagellate) genes in these trees likely arose via endosymbiotic genetic transfer due to one or more plastid endosymbioses, implicating ancestral Archaeplastida phyla, more evidently with green algae (in Viridiplantae) than with the red algae (Rhodophyta). Figure 3a shows the phylogenetic tree of beta-glucan synthesis-associated protein homologs, with a strongly supported (BS 97%) monophyletic clade containing Viridiplantae (i.e., the green algae Chlamydomonas, Volvox, and Chlorella), haptophytes, Stramenopiles (including diatoms), and dinoflagellates. The beta-glucans are key components of cellulose that form the thecate armour of the cell wall of dinoflagellates [36], as well as key carbohydrate storage [37]. This tree supports a putative green algal origin of the gene associated with beta-glucan synthesis in dinoflagellates and other closely related taxa, implicating an ancient HGT among these lineages. This is a more parsimonious explanation than to invoke massive gene losses in other alveolates and microbial eukaryotes.  Figure 3b shows the tree for putative abscisic acid 8 -hydroxylase, in which Viridiplantae taxa (largely plants) form a strongly supported (BS 100%) monophyletic clade with the diatom Fragilariopsis cylindrus and dinoflagellates. This tree supports a Viridiplantae origin of dinoflagellate genes, and subsequent divergence among the Suessiales (BS 98%) including C. goreaui. This enzyme is involved in regulating germination and dormancy of plant seeds via oxidation of the hormone abscisic acid [38]. It is also known as cytochrome P450 monooxygenase [39]. In dinoflagellates, this enzyme is known to regulate encystment and maintenance of dormancy [40], and the expression of this gene was found to be upregulated as an initial response to heat stress in a Cladocopium species [41]. The tree in Figure 3b shows that the protein homologs in dinoflagellates share sequence similarity to sequences in plants, whereas homologs from other green algae (the more-likely sources of HGT) are absent; given that plants are derived from green algal lineages, this result suggests that the green algal derived gene in plants, dinoflagellates, and the diatom F. cylindrus was likely subjected to differential functional divergence or gene loss among the green algae.
We also found evidence for more-recent genetic exchanges. Figure 4 shows the tree for a putative sulphate transporter, which has a strongly supported (BS 93%) monophyletic clade containing dinoflagellates and Viridiplantae (mostly green algae), separate from the usual sister lineage of Stramenopiles expected in the SAR grouping in the eukaryote tree of life [25]. This protein is involved in sulphate uptake, which in green algae has a direct impact on protein biosynthesis in the plastid [42]. This tree suggests a putative green algal origin of the genes in dinoflagellates, which implicates more-recent HGT than those observed in Figure 3. In contrast, some green algal derived genes appear to have undergone duplication upon the diversification of dinoflagellates, e.g., the tree of a hypothetical protein shown in Figure S4. Although the function of this protein in dinoflagellates remains unclear, the homolog in Arabidopsis thaliana (UniProt Q94A98; At1g65900) is localised in the chloroplast and implicated in cytokinesis and meiosis [43,44], lending support to an endosymbiotic genetic transfer associated with plastid evolution [45]. Our observation of ancient and recent genetic exchanges between green algae and dinoflagellates suggests HGT is a dynamic and ongoing process in dinoflagellate genome evolution.
Red algal origin of secondary plastids is well established [33,46]. The stronger signal of green algal than red algal origin we observed here based on a taxonomically broad and dinoflagellate-rich dataset (Table S5) lends support to the notion of an additional cryptic green algal endosymbiosis in the evolution of secondary plastids, instead of the "shopping bag" hypothesis that postulates equal proportions of acquired red or green algal genes [47]. Although green algal derived plastids in some dinoflagellates are also known [48,49], these taxa are not included in our analysis here due to a lack of genome data.
We observed a small proportion (7.1%) of trees that suggest putative genetic exchange between dinoflagellates with distantly related eukaryotes and prokaryotes, e.g., groups L through O in Figure 2c. We cannot dismiss that some of these may be artefacts due to sampling bias or even misidentified sequences in the database. For instance, the phylogeny of phosphatidylinositol 4-phosphate 5-kinase ( Figure S5) shows a strongly supported (BS 100%) clade containing 55 dinoflagellate sequences and 1 sequence from the coral Porites lutea representing Metazoa; this may be a case of misidentification of the sequence from the dinoflagellate symbiont associated with the coral.

Genes Implicating Vertical Inheritance
Among the high-confidence trees (recovery at BS ≥ 90%), 64.9% (groups A through E in Figure 2c) provide strong evidence of vertical inheritance; these trees contain a strongly supported (BS ≥ 90%) monophyletic clade containing dinoflagellates only (373), and dinoflagellates plus another closely related taxa of Alveolata (517), Stramenopiles (124), Rhizaria (65), and with the SAR group in the presence of Haptophyta and Cryptophyta (270), as expected based on our current understanding of eukaryote tree of life [25]. Figure S6 shows an example of these trees, specifically, ubiquitin carboxyl-terminal hydrolase. In this tree, all the major phyla are mostly well-resolved in strongly supported clades, e.g., Dinophyceae, Rhizaria, Stramenopiles, and Rhodophyta, each at BS 100%, and the clade of Alveolata+Rhizaria (BS 90%). Figure 5 shows another tree that contains a strongly supported (BS 100%) monophyletic clade of the SAR group, within which three clades (two supported at BS 99%, one at BS 100%), each containing similar dinoflagellate taxa, are recovered, highlighting gene-family expansion. Proteins within the three subclades putatively code for distinct functions: (a) an autophagy-related protein 18a (as with other non-dinoflagellate proteins positioned elsewhere in the tree), (b) a transmembrane protein 43, and (c) the pentatricopeptide repeat-containing protein GUN1. These distinct functions were identified, for each sub-clade, based on the annotated function of the top UniProt hit for the sequences within, and their distinct domain configurations ( Figure S7). This observation indicates vertical inheritance of the gene encoding transmembrane protein in dinoflagellates, which was then duplicated and underwent neo-or sub-functionalization to generate functional diversity. This result aligns with known features of gene-family evolution that generates functional novelty in dinoflagellates, including tandemly repeated genes that encode adaptive functions [18,50].

Conclusions
Our results demonstrate the power of long-read sequence data in elucidating key genome features in C. goreaui, at a comparable capacity to chromosome-level genome assemblies of other Symbiodiniaceae, including the resolution of duplicated genes, repetitive genomic elements, and TADs. These results support the expected high sequence and structural divergence of dinoflagellate genomes [9,10]. Comparative analysis of genes revealed clear evidence of lineage-specific innovation in C. goreaui and in dinoflagellates generally, implicating about one-half of C. goreaui genes; many (52.9%) C. goreaui genes remain dark, for which the encoded functions are unknown [24]. Our gene-by-gene phylogenetic analysis revealed the intricate evolutionary histories of genes in C. goreaui and dinoflagellates, with many too complex to be unambiguously interpreted.
Our results highlight how genetic transfer and gene duplication generated functional diversity and innovation in C. goreaui, and in combination with the conserved LTRs and DNA transposons, shaped the genome of this facultative symbiont [12]. The data generated from this study provide a useful reference for future studies of coral symbionts, and more broadly of dinoflagellates and microbial eukaryotes. The identified TAD regions, for instance, provide an excellent analysis platform to assess the presence of conserved gene-regulatory elements, e.g., promoters or enhancers of gene expression, as hypothesised in Lin et al. [22] to facilitate transcription of the unidirectional gene blocks. Our analytic workflow can be adapted and applied to study TADs in other assembled genomes of dinoflagellates.

Generation of Long-Read Genome and Transcriptome Data
Cladocopium goreaui SCF055-01 is a single-cell monoclonal culture first isolated from the coral Acropora tenuis at Magnetic Island, Queensland, Australia [51]. The cultures were maintained at the Australian Centre of Marine Science (AIMS) in Daigo's IMK medium at 26 • C, 90 µE/cm 2 /s −1 . High molecular weight genomic DNA was extracted following the SDS method described in Wilson et al. [52]. The sample was sent to the Ramaciotti Centre for Genomics (University of New South Wales, Sydney) for sequencing using the PacBio, first using RS II, then the Sequel platform (Table S1). DNA fragments of lengths 10-20 Kb were selected for the preparation of sequencing libraries. In total, 6.2 million subreads were produced (50.2 Gbp).
Total RNA was extracted from cultured SCF055 cells in the exponential growth phase (~10 6 cells), combining the standard Trizol method with the Qiagen RNeasy protocol, following the method of Rosic and Hoegh-Guldberg [53]. Quality and quantity of RNA were assessed using a Bioanalyzer and Qubit. The RNA sample was sent to the sequencing facility at the University of Queensland's Institute for Molecular Bioscience for generation of Iso-Seq data using the PacBio Sequel II platform. The Iso-Seq library was prepared using the NEBNext Single Cell/Low Input cDNA Synthesis and Amplification Module and the SMRTbell Express Template Prep Kit 2.0, following standard protocol. Sequencing was conducted using half of a Sequel II SMRT cell. From these raw data, we generated 3,534,837 circular consensus sequencing (CCS) reads (7.3 Gb; average 36 passes) using CCS v4.2.0. The CCS reads were then fed into the Iso-Seq pipeline v3.3.0 pipeline for standard Iso-Seq processing, which includes key steps of read refinement, isoform clustering, and polishing, resulting in 55,505 high-quality, non-redundant, polished Iso-Seq transcripts (total 79 Mb; N50 length 1493 bases).

De Novo Genome Assembly Combining Short-and Long-Read Sequence Data
We combined the long-read sequence data with all short-read sequence data from Liu et al. [7] in a hybrid genome assembly using MaSuRCA v3.4.2 [54]. Because the SCF055 culture is xenic, we adapted the approach in Iha et al. [55] to identify and remove putative contaminant sequences from bacterial, archaeal, and viral sources. Bowtie2 [56] was first used to map the genome sequencing data (Illumina paired-end reads) using the -very-fast algorithm to the assembled genome scaffolds to obtain information of sequencing depth. BlobTools v1.1 [57] was then used to identify anomalies of GC content and sequencing depth among the scaffolds, and to assign a taxon to each scaffold (using the default bestsum algorithm) based on hits in a BLASTn (E ≤ 10 −20 ) search against the NCBI nucleotide (nt) database (released 2020-01-08). We also mapped available transcriptome data onto the assembled genome to further assess gene structure to aid identification of intron-containing genes on the scaffolds as indication of eukaryote origin. To do this, we used our Iso-Seq transcripts (above) and the RNA-Seq data from Levin et al. [58] that we assembled using Trinity v2.9.1 [59] in de novo mode. Mapping was conducted using minimap2 v2.17-r975-dirty [60] (-secondary=no -ax splice:hq -uf -splice-flank=no), for which the code has been modified to recognise alternative splice-sites in dinoflagellate genes. Using the taxon assignment, genome coverage, and transcript support information, we identified and removed putative contaminant sequences from the genome assembly following a decision tree based on these results ( Figure S8). Chloroplast and mitochondrial genome sequences were identified and subsequently removed from the final genome assembly, following the method of Stephens et al. [18].

Estimation of Genome Size Based on Sequencing Data
To estimate the genome size, we adapted the k-mer-based approach used by González-Pech et al. [10]. We first enumerated k-mers of size k = 21 from the sequencing reads using Jellyfish v2.3.0 [61] with the command jellyfish histo -high=1,000,000. The resulting histogram of k-mer count was used as input for GenomeScope2 [62] to estimate a haploid genome size. Genomes of C. goreaui (and other Symbiodiniaceae) are thought to be haploid [7], and we did not observe bimodal distribution of k-mer coverage expected in a diploid genome ( Figure S1).

Ab Initio Prediction of Protein-Coding Genes
To predict protein-coding genes from the assembled genome sequences, we adopted the approach in Chen et al. [15], using the workflow tailored for dinoflagellate genomes (https://github.com/TimothyStephens/Dinoflagellate_Annotation_Workflow (accessed on 20 January 2021)), which was also applied in earlier studies of Symbiodiniaceae genomes [7,10].
We used four other programs for predicting protein-coding genes. The "golden genes" above were used as a training set for SNAP [66] and AUGUSTUS v3.3.1 [67] to predict genes from the repeat-masked genome; the code for AUGUSTUS was altered to recognise alternative splice sites of dinoflagellates (available at https://github.com/chancx/dinoflagalt-splice (accessed on 20 January 2021)). We also used GeneMark-ES [68] and MAKER v2.31.10 [69], for which the code was modified to recognise alternative splice sites, in protein2genome mode guided by the SwissProt database (retrieved 27 June 2018) and other predicted proteins from Symbiodiniaceae (Table S7). Finally, gene predictions from all five methods, i.e., the ab initio predictions (from GeneMark-ES, AUGUSTUS, and SNAP), MAKER protein-based predictions, and PASA transcript-based predictions, were integrated using EvidenceModeler v1.1.1 [70] to yield the gene models (see Chen et al. [15] for detail). The gene models were further polished with PASA [63] for three rounds to incorporate isoforms and UTRs, yielding the final gene models.

Functional Annotation of C. goreaui Genes
For functional annotation, all predicted proteins were searched against all protein sequences on Uniport (release 2021_03). Only hits with E ≤ 10 −5 were retained. Gene Ontology (GO) terms associated with top hits were first retrieved from the UniProt website using the Retrieve/ID mapping tool, then mapped to the corresponding queries.

Analysis of Duplicated Genes
To identify and classify duplicated genes in C. goreaui, we follow González-Pech et al. [10] to perform all-versus-all BLASTp (E ≤ 10 −5 ) on corresponding proteins of all predicted genes. The top five hits (excluding the query itself) were used as input for MCScanX [71] in duplicate_gene_classifier mode to classify genes into five categories: singleton (single-copy genes), dispersed (paralogs away from each other; i.e., at least 20 genes apart), proximal (paralogs near each other), tandem (paralogs in tandem gene block), and segmental (duplicates of collinear blocks).

GO Enrichment Analysis
R package topGO [72] was used for enrichment analysis of GO terms. In total, 21,356 genes were annotated with one or more GO terms; these were used as the background set. Genes in tandem repeats and segmental duplication were used as the test set to search for enriched GO terms, in independent analyses. Fisher's exact test was applied to assess statistical significance, and instances with p ≤ 0.01 were considered significant.

Analysis of Unidirectional Gene Blocks and TADs
For this part of analysis, we focused on five representative assembled genomes of dinoflagellates: C. goreaui from this study, C. goreaui from Chen et al. [15], Fugacium kawagutii [19], Breviolum minutum [15], and Symbiodinium microadriaticum [20]. Unidirectional gene blocks were identified based on a block of genes within which their orientations are the same. A putative TAD boundary is the region at which the orientations of two blocks converged. A putative TAD central region is the region at which the orientations diverged. We analysed putative TAD regions and unidirectional gene blocks based on the minimum number of genes within a block, N, at N = 4, 6, 8, and 10.
To validate the putative TADs, we searched for GC dip in the TAD boundaries, following Nand et al. [20]. On each scaffold, for each sliding 4Kb-window, we calculated the localised G+C content. A putative region of GC dip is identified based on three criteria: (1) the G+C in a 4Kb region is lower than average GC content of the entire scaffold (i.e., the background); (2) the largest difference between the localised G+C and the background G+C is larger than 0.05%; and that (3) the implicated region is longer than 5000 bp.

Phylogenomic Analysis of C. goreaui Genes
To investigate the putative origins of C. goreaui genes, we compiled a comprehensive protein database (2,841,408 sequences) of 96 broadly sampled taxa from diverse lineages, encompassing eukaryotic, bacterial, and archaeal taxa, of which 30 are dinoflagellates (Table S5). For species where there were multiple datasets for the same isolate, the protein sequences were clustered at 90% sequence identity using CD-HIT-v4.8.1 [65] to reduce redundancy. Isoforms are reduced to retain one representative protein (longest) per gene.
Using all 2,841,408 protein sequences from the database (Table S5), homologous protein sets were inferred using OrthoFinder v2.3.8 [73] at the default setting. For this analysis, we restricted our analysis to 177,346 putative homologous sets of C. goreaui proteins (i.e., sets in which one or more C. goreaui sequence is represented). For homologous sets that contain only Dinophyceae and one other phylum (an exclusive gene-sharing partner), the putative gene-sharing partner was assessed based on the number of implicated homologous sets, requiring at least x number of genes in each set (for x = 2, 20, 40, and 60). For the other protein sets, multiple sequence alignment was performed using MAFFT-v7.453 [74] with parameters -maxiterate 1000 -localpair. Following the methods of Stephens et al. [24], ambiguous and non-phylogenetically informative sites in each alignment were trimmed using trimAl-v1.4.1 [75] in two steps: trimming directly with -automated1, then withresoverlap 0.5 -seqoverlap 50. A maximum likelihood tree for each protein set was inferred from these trimmed alignments, using IQ-TREE2 [76], with an edge-unlinked partition model and 2000 ultrafast bootstraps. The initial step of IQ-TREE2 by default is to perform a composition chi-squared test for every sequence in an alignment, the sequence for which the character composition significantly deviates from the average composition in the alignment is removed. Alignments filtered this way were further removed if the target C. goreaui sequence was removed, and if the alignment contained no more than four sequences. In total, 5246 trees were used in subsequent analysis.

Inference of the Dinoflagellate Species Tree
To infer the dinoflagellate species tree, we first inferred homologous protein sets with OrthoFinder v2.3.8 for the 30 dinoflagellate taxa and Perkinsus marinus (outgroup) in Table S5. The 3411 strictly orthologous, single-copy protein sets (i.e., sets in which each dinoflagellate taxon is represented no more than once) were used for inferring the species tree. For each set, multiple sequence alignment was performed, and the alignment was trimmed, per our approach described above. A consensus maximum likelihood reference species tree was then inferred using IQ-TREE2, with an edge-unlinked partition model and 2000 ultrafast bootstraps.

Inference of C. goreaui Gene Origins
Putative origins of C. goreaui genes were determined based on the presence of strongly supported clades (determined at a bootstrap support threshold) that include C. goreaui (and/or other dinoflagellates) and one other taxon group (e.g., a phylum). We used PhySortR [31] to quickly sort through thousands of protein trees (i.e., we assume these as gene trees) for the specific target clades, independently at bootstrap thresholds of ≥ 90% (more stringent, higher confidence), ≥ 70%, and ≥ 50% (less stringent, lower confidence); default values were used for other parameters. Our 176-step tree-sorting strategy is detailed in Table S6. Briefly, we sorted the trees based on recovery of a strongly supported monophyletic clade containing both the subject group (dinoflagellates) and the target group in three stages: first (a) with target groups implicated in endosymbiosis in the evolution of plastid (e.g., Archaeplastida); then (b) with closely related target group expected under vertical inheritance; and finally (c) with other remotely related eukaryote or prokaryote target group as further indication of horizontal genetic transfer. At each stage, the subject group would proceed from the most inclusive (i.e., dinoflagellates plus other closely related taxa in the SAR supergroup, Cryptophyta, and Haptophyta), and progressively the most distant lineage was removed in each iterative sorting, leaving only dinoflagellates. For each subject group, the target group would proceed similarly from the most inclusive, e.g., in stage one, all three phyla of Archaeplastida, to subsequent individual phylum.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/microorganisms10081662/s1, Data S1: All sorted trees where Dinophyceae taxa form a strongly supported clade with one other phylum; Figure S1: Genome size estimation for Cladocopium goreaui using GenomeScope v2.0, based on frequency distribution of k-mers from short-read sequence data, shown for exact (top) and log-transformed (below) values; Figure S2: An example of G+C dip observed in the C. goreaui genome of a putative boundary of topologically associated domain (TAD), shown for 4000-bp sliding windows on scaffold scf7180000355754. The x-axis shows the centre position of each sliding window along the scaffold. Of the dashed lines, the red line indicates the mean %G+C of the scaffold (as background), the green lines signify the G+C dip region, and the black lines signify a putative TAD boundary; Figure S3: Species tree of 30 dinoflagellate taxa and Perkinsus as outgroup, inferred from 3411 strictly orthologous (single-copy) protein sets; Figure S4: Maximum likelihood tree showing gene expansion of a green algal derived protein family that contain a remote homolog in Arabidopsis thaliana with function implicated in cytokinesis and meiosis; Figure S5: Maximum likelihood tree of phosphatidylinositol 4-phosphate 5kinase showing possible misidentification of the sequence from the dinoflagellate symbiont associated with the coral; Figure S6: Maximum likelihood tree of ubiquitin carboxyl-terminal hydrolase showing strong evidence of vertical inheritance; Figure S7: Domain configuration for a representative sequence from each of the three sub-clades in the tree of Figure 5, shown for the autophagy-related protein 18a, the transmembrane protein 43, and the pentatricopeptide repeat-containing protein GUN1; Figure S8: Decision tree for identification and removal of putative contaminant sequences; Table S1: PacBio long-read genome sequencing data from C. goreaui generated in this study; Table S2: Enriched GO terms for tandemly repeated genes, shown for Biological Process (BP), Cellular Component (CC), and Molecular Function (MF); Table S3: Enriched GO terms for genes in segmental duplicates, shown for Biological Process (BP), Cellular Component (CC), and Molecular Function (MF); Table S4: Enriched GO terms for genes disrupting unidirectional coding of gene blocks, shown for Biological Process (BP), Cellular Component (CC), and Molecular Function (MF); Table S5: Protein database for phylogenomic analysis; Table S6: Sorting order of phylogenetic trees and the associated classifications; Table S7: Protein sequences used to guide ab initio prediction of protein-coding genes. Data Availability Statement: Genome and transcriptome long-read sequence data generated from this study are available at NCBI Sequence Read Archive (BioProject accession PRJEB55036). The assembled genome, predicted gene models and proteins, and the identified organellar genome sequences from this study are available at https://doi.org/10.48610/fba3259 (accessed on 30 June 2022).
Acknowledgments: This project is supported by high-performance computing facilities at the National Computational Infrastructure (NCI) National Facility systems through the NCI Merit Allocation Scheme (Project d85) awarded to C.X.C., the University of Queensland Research Computing Centre, and HPC from the Australian Centre for Ecogenomics at UQ. We are grateful for the technical assistance provided by Carlos Alvarez Roa and Lesa Peplow in the extractions of DNA and RNA from C. goreaui SCF055 at the Symbiont Culturing Facility at the Australian Institute of Marine Science in Townsville, Queensland.