Genome Sequence of Chrysotila roscoffensis, a Coccolithphore Contributed to Global Biogeochemical Cycles

Chrysotila is a genus of coccolithophores. Together with Emiliania, it is one of the representative genera in the Haptophyta which have been extensively studied. They are photosynthetic unicellular marine algae sharing the common characteristic of the production of CaCO3 platelets (coccoliths) on the surface of their cells and are crucial contributors to global biogeochemical cycles. Here, we report the genome assembly of Chrysotila roscoffensis. The assembled genome size was ~636 Mb distributed across 769 scaffolds with N50 of 1.63 Mb, and maximum contig length of ~2.6 Mb. Repetitive elements accounted for approximately 59% of the genome. A total of 23,341 genes were predicted from C. roscoffensis genome. The divergence time between C. roscoffensis and Emiliania huxleyi was estimated to be around 537.6 Mya. Gene families related to cytoskeleton, cellular motility and morphology, and ion transport were expanded. The genome of C. roscoffensis will provide a foundation for understanding the genetic and phenotypic diversification and calcification mechanisms of coccolithophores.


Introduction
Coccolithophores, belonging to the Haptophyta, are photosynthetic unicellular marine algae sharing the common characteristic of the production of CaCO 3 platelets (coccoliths) on the surface of their cells. They are globally distributed across all oceans except the polar ones, with some species forming blooms that can be observed from artificial satellites [1]. Coccolithophores play a fundamental role in the marine carbon cycle through the fixation of inorganic carbon by photosynthesis (the organic carbon pump) and the export of CO 2 during calcification (the carbonate counter pump) [2]. Consequentially, they are thought to be responsible for about 10% of global carbon fixation [3] and to produce up to 50% of oceanic CaCO 3 [4]. Coccolithophores also affect the global sulfur cycle through their production of dimethylsulfoniopropionate (DMSP), the major precursor of atmospheric dimethyl sulfide (DMS) [5]. In addition, coccoliths provide ballast that drives the transfer of particulate organic matter to the deep ocean [6].
Given the ecological and biogeochemical importance of coccolith formation, the mechanisms of calcification have raised a considerable interest and research. The calcification is common for a number of organisms, including unicellular organisms, invertebrates, and vertebrates, but it has unique cellular and biochemical characteristics in the coccolithophores. Firstly, the sites of calcification in the coccolithophores differ from those in other organisms.
These sites are either extracellular or intercellular for most biological calcification, while coccoliths are produced intracellularly in a Golgi-derived coccolith vesicle (CV) [7]. Secondly, the composition of the organic matrix is different between the coccolithophores and other organisms. Acidic polysaccharides, in contrast to proteins found in other organisms, such as bivalve mollusks [8], crayfish [9], pearl oyster [10], and fishotolith [11], are the main component of the organic matrix and are predominantly associated with coccolith formation in the coccolithophores [12]. Until now, the molecular mechanisms and regulators underlying characteristics of calcification in the coccolithophores are still not fully elucidated.
There are approximately 200 extant species of coccolithophores [13]; Emiliania and Chrysotila (formerly Pleurochrysis) are the two most explored genera. To our knowledge, only Emiliania huxleyi genome is available in the coccolithophores, even in the Haptophyta [14]. Moreover, these two genera exhibit high degree of genetic and phenotypic variations. For example, the gene content varied from 10% to 30% among E. huxleyi strains [15]. C. carterae (formerly Pleurochrysis carterae) calcification takes place at night, whereas E. huxleyi coccolith is mainly formed during day [16]. In Emiliania, the reticular body (RB) is closely connected to CV and is important in providing raw material for calcification [17] but appears to be absent in Chrysotila [7]. Three types of acidic polysaccharides (PS1, PS2, and PS3) were identified in Chrysotila, but Emiliania lacks PS1 and PS2, which deliver Ca 2+ to CV in Chrysotila [12]. There are very limited data on the evolution and mechanisms of these variability in the coccolithophores.
While Emiliania species distribute globally in almost all ocean ecosystems, species of genus Chrysotila was mainly found in coastal, estuarine, brackish waters and in marine aquaculture pools. Notorious foaming blooms of Chrysotila species frequently occur in these areas. Some species in genus Chrysotila were lethal to brine shrimp [18], a model organism in many toxicological research. However, the mechanism of the lethal effects is not unraveled. Non-calcified filamentous colonies in the life cycle of Chrysotila species is typical heteromorphic characteristic in this genus [19]. In the present study, we report on the assembly and annotation of the C. roscoffensis genome. The data will provide a foundation for understanding the genetic and phenotypic diversification and calcification mechanisms of coccolithophore, a key player in the global biogeochemical cycles.

C. roscoffensis Strain and DNA Extraction
Genomic DNA from C. roscoffensis (strain NMBjih026-8, Figure 1) was used for library construction and sequencing. The strain was originally isolated from coastal waters in Xiangshan Bay (N 29 •  For DNA isolation, fresh culture of motile coccolith-bearing cells was inoculated and grown in sterilized natural seawater (pH 8.30, salinity 24‰) enriched with f/2-Si culture medium, at 20 • C in light density of 60 µmol/(m 2 ·s) with a light/dark cycle of 12 h:12 h. To minimize bacterial contamination, the culture medium was supplemented with the appropriate antibiotics: 100 mg/L ampicillin, 100 mg/L kanamycin, 100 mg/L neomycin, 100 mg/L streptomycin, and 30 mg/L chloramphenicol. The algal cells were harvested at exponential phase, and total genomic DNA was extracted with the Plant DNA Kit (Tiangen, Beijing, China) under the guidance of the manufacturer's instructions. One percent of agarose gel electrophoresis and Qubit Fluorometer were used to check the quality and quantity of the isolated DNA, ensuring in the final concentration ≥ 20 ng/uL, a total amount ≥ 50 ng, no or slightly degraded, and main DNA band ≥ 5 Kb. For DNA isolation, fresh culture of motile coccolith-bearing cells was inoculated and grown in sterilized natural seawater (pH 8.30, salinity 24‰) enriched with f/2-Si culture medium, at 20 °C in light density of 60 µmol/(m 2 •s) with a light/dark cycle of 12 h:12 h. To minimize bacterial contamination, the culture medium was supplemented with the appropriate antibiotics: 100 mg/L ampicillin, 100 mg/L kanamycin, 100 mg/L neomycin, 100 mg/L streptomycin, and 30 mg/L chloramphenicol. The algal cells were harvested at exponential phase, and total genomic DNA was extracted with the Plant DNA Kit (Tiangen, Beijing, China) under the guidance of the manufacturer's instructions. One percent of agarose gel electrophoresis and Qubit Fluorometer were used to check the quality and quantity of the isolated DNA, ensuring in the final concentration ≥ 20 ng/uL, a total amount ≥ 50 ng, no or slightly degraded, and main DNA band ≥ 5 Kb.

Library Construction and Sequencing
Three libraries were constructed and sequenced. A short DNA library with an insert size of 350 bp was prepared and sequenced on Illumina Xten platform as 150 bp pairedend reads. One SMRT Bell library with an insert size of 20 kb was constructed, and the sequencing was performed on PacBio Sequel platform. The linked read sequencing library was also performed on a 10X Genomics GemCode platform.

Genome Size Estimation and De Novo Genome Assembly
Genome size and heterozygosity of C. roscoffensis were estimated by setting k-mer to 17. The 17-mer frequency distribution analysis of all clean reads from the Illumina platform was performed using SOAPdenovo [20]. PacBio reads were subjected to de novo assembly using FALCON (https://github.com/PacificBiosciences/FALCON/, accessed on 19 December 2021). First, error-correcting PacBio raw sequencing data was performed using FALCON. After correction, all reads were aligned to each other and assembled into contigs and these contig sequences were polished using Quiver algorithm. The draft assembly was corrected with Pilon [21] based on the 52.41× high-quality Illumina

Library Construction and Sequencing
Three libraries were constructed and sequenced. A short DNA library with an insert size of 350 bp was prepared and sequenced on Illumina Xten platform as 150 bp pairedend reads. One SMRT Bell library with an insert size of 20 kb was constructed, and the sequencing was performed on PacBio Sequel platform. The linked read sequencing library was also performed on a 10X Genomics GemCode platform.

Genome Size Estimation and De Novo Genome Assembly
Genome size and heterozygosity of C. roscoffensis were estimated by setting k-mer to 17. The 17-mer frequency distribution analysis of all clean reads from the Illumina platform was performed using SOAPdenovo [20]. PacBio reads were subjected to de novo assembly using FALCON (https://github.com/PacificBiosciences/FALCON/, accessed on 19 December 2021). First, error-correcting PacBio raw sequencing data was performed using FALCON. After correction, all reads were aligned to each other and assembled into contigs and these contig sequences were polished using Quiver algorithm. The draft assembly was corrected with Pilon [21] based on the 52.41× high-quality Illumina sequencing reads to collect enough corrected genome sequences. After that, the 10× Genomics data was aligned to the assembly by BWA [22] using default settings and the quality of assembly was assessed by mapping the clean short insert size reads to the scaffolds. Finally, we also evaluated the level of genome completeness of the final genome assembly using CEGMA [23].

Repetitive Sequences Annotation
Repeat sequences were identified and classified using a combination of de novo and homology-based approaches. The ab initio prediction program RepeatModeler (http:// www.repeatmasker.org/RepeatModeler.html) was employed to construct a de novo repeat library from the C. roscoffensis genomes. The homology-based annotation was performed by mapping the C. roscoffensis genomes onto Repbase database (http://www.girinst.org/) and TE protein database using RepeatMasker (http://www.repeatmasker.org/RMDownload. html) and RepeatProteinMask software [24], respectively. Tandem repeats were identified using Tandem Repeats Finder [25].
We annotated the gene functions according to the alignments to two integrated protein sequence databases (SwissProt and NR) by BLASTP with an e-value cutoff of at 1e −5 . The InterProScan [37] was adopted to search motifs and conserved functional domains using Pfam and GO databases. The pathways involved in interactions, reactions, and relationships among genes were assigned by BLAST searching the KEGG databases [38], with an E-value cutoff at 1e −5 .

Phylogenetic and Comparative Genomic Analysis
We performed comparative analysis between the C. roscoffensis genes and the genes identified from C. reinhardtii, C. eustigma, Chromochloris zofingiensis, Micromonas pusilla, Chlorella sorokiniana, Chara braunii, Thalassiosira oceanica, Thalassiosira pseudonana, P. tricornutum, Aureococcus anophagefferens, Saccharina japonica, E. huxleyi, Symbiodinium microadriaticum, Porphyra umbilicalis, Galdieria sulphuraria, Chondrus crispus, Bigelowiella natans, A. thaliana and Oryza sativa (Table S1). The genes of each species were filtered as follows: first, only the longest transcript was retained when multiple transcripts are present in one gene; second, only the genes with an encoding length longer than 50 amino acids were retained. Then, the similarity of protein sequences between pairs of all species was obtained by blastp with the e-value 1e −5 . OrthoMCL (http://orthomcl.org/orthomcl/) [39] was applied to cluster into paralogous and orthologous among 20 species protein datasets with the inflation parameter 1.5. MUSCLE [40] (http://www.drive5.com/muscle/)was adopted to align the protein sequences of each of 25 one-to-one single-copy gene families shared by all species, and all the results were combined into a super alignment matrix. Then, the 20-species phylogenetic tree was constructed using RaxML [41] (http: //sco.h-its.org/exelixis/web/software/raxml/index.html) with the maximum likelihood method, and the bootstrap was 100. B. natans was selected as the outgroup. We performed divergence dating based on the phylogenetic analysis using MCMCtree in PAML package [42,43].
The gene families that expanded and contracted in all genomes were identified using CAFÉ [44] based on phylogenetic analysis. To further functionally annotate the expanded gene families, the gene ontology (GO) term was retrieved from InterProScan results and the enrichment analysis was performed.

Genome Analysis of C. roscoffensis
Based on the total number of k-mers (26, 900,644,184), the C. roscoffensis genome size was calculated to be approximately 674.07 Mb and the heterozygosity was 0.64%, which indicated a relatively lower intraspecific variation compared to E. huxleyi [14] (Figure 2 and Table 1). To prepare for following de novo assembly, we filtered the low quality, duplicated, and adapter-containing reads generated by Illumina Xten platform to ensure high accuracy. After that, a total of 35.33 Gb (52.41-fold coverage of the genome) data were retained ( Table 2). A total of 53.12 Gb (78.80-fold coverage of the genome) PacBio sequencing data were produced for the assembly ( Table 2). The 93.22 Gb library was sequenced with 150 bp paired-end reads were generated by an Illumina HiSeq X Ten platform ( Table 2). The assembled genome size was~636 Mb distributed across 769 scaffolds ( Table 3). The final assembly result is close to the estimated genome size based on 17-mer analyses. Almost 85.30% of reads could successfully align to final assembly (Table S2). CEGMA analysis showed that 81.05% conserved core eukaryotic genes could be captured in our genome, of which 75.00% were complete (Table S3). These results indicated that the genome assembled in this paper contained comprehensive genomic information.
ting based on the phylogenetic analysis using MCMCtree in PAML package [42,43].
The gene families that expanded and contracted in all genomes were identified using CAFÉ [44] based on phylogenetic analysis. To further functionally annotate the expanded gene families, the gene ontology (GO) term was retrieved from InterProScan results and the enrichment analysis was performed.

Genome Analysis of C. roscoffensis
Based on the total number of k-mers (26,900,644,184), the C. roscoffensis genome size was calculated to be approximately 674.07 Mb and the heterozygosity was 0.64%, which indicated a relatively lower intraspecific variation compared to E. huxleyi [14] (Figure 2 and Table 1). To prepare for following de novo assembly, we filtered the low quality, duplicated, and adapter-containing reads generated by Illumina Xten platform to ensure high accuracy. After that, a total of 35.33 Gb (52.41-fold coverage of the genome) data were retained ( Table 2). A total of 53.12 Gb (78.80-fold coverage of the genome) PacBio sequencing data were produced for the assembly ( Table 2). The 93.22 Gb library was sequenced with 150 bp paired-end reads were generated by an Illumina HiSeq X Ten platform ( Table  2). The assembled genome size was ~636 Mb distributed across 769 scaffolds ( Table 3). The final assembly result is close to the estimated genome size based on 17-mer analyses. Almost 85.30% of reads could successfully align to final assembly (Table S2). CEGMA analysis showed that 81.05% conserved core eukaryotic genes could be captured in our genome, of which 75.00% were complete (Table S3). These results indicated that the genome assembled in this paper contained comprehensive genomic information. 3) based on the sequencing data from short insert size libraries and the genome size was estimated based on the formula: genome size = total_kmer_num / kmer_depth, where total_kmer_num is the total number of K-mer and kmer_depth indicates the peak position on the K-mer frequency distribution map. Heterozygous peak indicates the genome heterozygosity, repeat peak represents the repeat rate of the genome. 3) based on the sequencing data from short insert size libraries and the genome size was estimated based on the formula: genome size = total_kmer_num / kmer_depth, where total_kmer_num is the total number of K-mer and kmer_depth indicates the peak position on the K-mer frequency distribution map. Heterozygous peak indicates the genome heterozygosity, repeat peak represents the repeat rate of the genome.

Genome Annotation
The results show that 58.54% of C. roscoffensis genome consists of repetitive elements (Table 4). Among these repeats, 53.67% could be divided into known repeat families. Longterminal repeats (LTRs) were the most abundant repeat family, accounting for 37.04% of the genome size ( Table 5). The second largest family in C. roscoffensis was DNA elements, which account for 5.66% of the genome size. A total of 23,341 genes were yielded from C. roscoffensis genome and the average lengths of CDS, exon, and intron were 1596 bp, 277 bp, and 719 bp, respectively (Table 6). Finally, a total of 23,216 genes were predicted to be functional, accounting for 99.5% of all genes in C. roscoffensis genome (Table 7).

Phylogenetic and Comparative Genomic Analysis
The distribution of genes in C. roscoffensis and other 19 species was shown in Figure 3. Additionally, common and unique gene families in C. roscoffensis, E. huxleyi, S. japonica, T. oceanica, and T. pseudonana were presented in Figure 4. Phylogenetic analysis has shown that the divergence time between C. roscoffensis and E. huxleyi is estimated to be around 537.6 Mya ( Figure S1). This result suggested the divergence between C. roscoffensis and E. huxleyi was much earlier than previously predicted (approximately 250 Mya) [45].

Expanded Coccoliths-Related Gene Families
Compared with E. huxleyi, there were 22 significantly expanded gene families and 39 significantly contracted gene families were identified in C. roscoffensis ( Figure S1). There are 60 GO terms were significantly enriched among the expanded gene families (p ≤ 0.05, Table S4). Among these significantly enriched GO terms, there are 16 terms associated with cytoskeleton, cellular motility and morphology, such as 'dynein complex', 'cellular component movement', 'microtubule motor activity', 'microtubule-based movement', 'microtubule-based process', 'motor activity', 'microtubule cytoskeleton', 'microtubule associated complex', 'cytoskeletal part', 'cytoskeleton', 'anatomical structure morphogenesis', 'cilium or flagellum-dependent cell motility', 'axonemal dynein complex', 'cell morphogenesis', 'anatomical structure development', and 'non-membrane-bounded organelle'. The cytoskeleton plays fundamental roles in intracellular transport, secretion of cell wall materials, and the regulation of cell morphology in many eukaryotes [46]. In several species, the disruption of cytoskeleton prevents the secretion of coccoliths, resulting in the formation of malformed coccoliths [47,48]. The roles of cytoskeleton in calcification, such as regulating the shape of the coccolith vesicle and controlling vesicle and cell movements by interacting with the membrane trafficking system, have been proposed [5]. Thus, the significant expansion of families of genes associated with cytoskeleton in C. roscoffensis leads to a hypothesis that the calcification and morphological characteristics are associated with cytoskeleton and cellular motility.

Phylogenetic and Comparative Genomic Analysis
The distribution of genes in C. roscoffensis and other 19 species was shown in Figure  3. Additionally, common and unique gene families in C. roscoffensis, E. huxleyi, S. japonica, T. oceanica, and T. pseudonana were presented in Figure 4. Phylogenetic analysis has shown that the divergence time between C. roscoffensis and E. huxleyi is estimated to be around 537.6 Mya ( Figure S1). This result suggested the divergence between C. roscoffensis and E. huxleyi was much earlier than previously predicted (approximately 250 Mya) [45].   Here, we also identified a set of significantly enriched GO terms associated with ion transport. The coccolith is produced in a Golgi-derived CV and then is secreted to the cell surface through exocytotic pathways [5]. The calcification process presents a remarkable case of transport physiology, requiring rapid rates uptake of Ca 2+ and HCO 3 − from the surrounding seawater into the CV and meanwhile removal of the produced H + which may exert pressure on the internal pH homeostasis of the cell [49,50]. The expansion of ion transport process related genes could reflect the demand for delivery of substrates and removal of products during calcification in C. roscoffensis.

Conclusions
In conclusion, we report the genome sequencing, assembly, and annotation of the coccolithophore, C. roscoffensis. The assembled genome size was~636 Mb distributed across 769 scaffolds with N50 of 1.63 Mb, and maximum contig length of~2.6 Mb. Repetitive elements accounted for approximately 59% of the genome. A total of 23,341 genes were predicted from C. roscoffensis genome. The divergence time between C. roscoffensis and E. huxleyi was estimated to be around 537.6 Mya. Gene families related to cytoskeleton, cellular motility, and morphology and ion transport were expanded. These data are valuable genetic resource for elucidating coccolithophore biology.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/genes13010040/s1, Figure S1: Estimation of divergence time and expansion and contraction gene families in C. roscoffensis, Table S1: Basic statistical results of C. roscoffensis and relative species, Table S2. Coverage statistics of C. roscoffensis genome, Table S3. Assessment the gene coverage rate using CEGMA and Table S4. Enriched GO terms of expanded genes in C. roscoffensis genome assembly.
Author Contributions: X.Y. and C.Z. designed the experiments and managed the project. R.M., L.Z. and P.X. prepared the materials. X.H. and Y.C. performed genome assembly and data analysis. R.M., K.L. and C.Z. mainly wrote the manuscript. Q.L. and J.X., advised and coordinated the study. All authors contributed to manuscript writing and reviewing and approved the final version for submission. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The raw sequencing data of the genomic and the transcriptome are available via NCBI with the BioProject accession number PRJNA648277 and BioSample accession number SAMN15644355. The assembly data have been deposited in NCBI under project accession No. SAMN15637150.

Conflicts of Interest:
The authors declare that they have no competing interest.