Genome Sequencing and Analysis of Thraustochytriidae sp. SZU445 Provides Novel Insights into the Polyunsaturated Fatty Acid Biosynthesis Pathway

Thraustochytriidae sp. have broadly gained attention as a prospective resource for the production of omega-3 fatty acids production in significant quantities. In this study, the whole genome of Thraustochytriidae sp. SZU445, which produces high levels of docosapentaenoic acid (DPA) and docosahexaenoic acid (DHA), was sequenced and subjected to protein annotation. The obtained clean reads (63.55 Mb in total) were assembled into 54 contigs and 25 scaffolds, with maximum and minimum lengths of 400 and 0.0054 Mb, respectively. A total of 3513 genes (24.84%) were identified, which could be classified into six pathways and 44 pathway groups, of which 68 genes (1.93%) were involved in lipid metabolism. In the Gene Ontology database, 22,436 genes were annotated as cellular component (8579 genes, 38.24%), molecular function (5236 genes, 23.34%), and biological process (8621 genes, 38.42%). Four enzymes corresponding to the classic fatty acid synthase (FAS) pathway and three enzymes corresponding to the classic polyketide synthase (PKS) pathway were identified in Thraustochytriidae sp. SZU445. Although PKS pathway-associated dehydratase and isomerase enzymes were not detected in Thraustochytriidae sp. SZU445, a putative DHA- and DPA-specific fatty acid pathway was identified.


Introduction
Long-chain polyunsaturated fatty acids (LC-PUFAs), such as docosahexaenoic acid (DHA, 22:6n-3), eicosapentaenoic acid (EPA, 20:5n-3), and arachidonic acid (ARA, 20:4n-6), have gained increasing attention because of their potential health benefits to humans. These benefits have been demonstrated were 5,997,638 bp and 2,552,134 bp, respectively. The GC content of the scaffolds and contigs was 45.04%. The Poisson distributions of the GC content and GC depth analysis ( Figure 1) suggested that there was no external DNA contamination or sequencing bias, demonstrating that the genome assembly was of high quality. In addition, according to the sequence of the sequenced marine fungus, the (G − C)/(G + C) calculation method was used for GC skew analysis, and based on the results of gene distribution, ncRNA distribution, and annotation, a gene circle was constructed that indicated the distribution of each component in the genome (Figure 2). The genome of Thraustochytriidae sp. SZU445 (Table 2) contains approximately 14,145 genes, 18,768 exons, 14,145 CDSs, and 4623 introns. The total length of genes is 26,947,341 bp, with an average length of 1905.08 bp. Among all genes, those with lengths of 200-499 nt and 500-999 nt were drastically more abundant than those of other lengths ( Figure 3). Based on the assembly and annotation results, 1106 copies of the noncoding RNA were predicted, which included nearly 493 copies (0.06%) of tRNA, 235 copies (0.64%) of rRNA, and 77 copies (0.01%) of snRNA (Table 3).  The N50 length is used to determine the assembly continuity; the higher the better. N50 is a weighted median statistic that 50% of the total length is contained in transcripts that are equal to or larger than this value.  SZU445. The abscissa is the GC content, and the ordinate is the average depth. The scatter plot shows a shape that approximates a Poisson distribution and shows that sequencing data have low GC bias. SZU445. The abscissa is the GC content, and the ordinate is the average depth. The scatter plot shows a shape that approximates a Poisson distribution and shows that sequencing data have low GC bias.

Genome Sequence Annotation
All genes of Thraustochytriidae sp. SZU445 were aligned to sequences in 18 public databases, including the Carbohydrate-Active enZYmes Database (CAZY), the Transporter Classification Database (TCDB) database, Swiss-Prot, InterPro, GO, KOG/COG, and KEGG ( Table 4). The number and percentage of genes annotated by each database are presented in Table 4. Notably, 30.35% of the genes were unidentifiable, the percentage of which was similar to that observed in other species [21][22][23].

Genome Sequence Annotation
All genes of Thraustochytriidae sp. SZU445 were aligned to sequences in 18 public databases, including the Carbohydrate-Active enZYmes Database (CAZY), the Transporter Classification Database (TCDB) database, Swiss-Prot, InterPro, GO, KOG/COG, and KEGG ( Table 4). The number and percentage of genes annotated by each database are presented in Table 4. Notably, 30.35% of the genes were unidentifiable, the percentage of which was similar to that observed in other species [21][22][23].    The GO database was founded by the Gene Ontology Association in 1988 including three categories: (1) cellular component, used to describe locations, subcellular structures, and macromolecular complexes such as nucleoli, telomeres, and the recognition of the initial complex; (2) molecular function, used to describe the gene functions and gene products, such as binding to carbohydrates or ATP hydrolase activity; and (3) biological process, used to describe the ordered combination of molecular functions to achieve broader biological functions, such as mitosis [24]. Genes are assigned to one or more of these categories depending on the nature of the product. Through the GO database annotation, we speculated the possible functions of genes based on their annotations in different categories. In Thraustochytrium sp. SZU445, 22,446 genes were annotated based on the GO database ( Figure 4). The majority genes in the biological process category were involved in cellular and metabolic processes, with 3084 and 2634 genes identified, respectively. Simultaneously, only one gene participated in cell proliferation and detoxification in the cellular component category. For the cellular component category, cell (1027 genes), cell part (1027 genes), and membrane (872 genes) were the top three terms. In the molecular function category, the number of genes participating in binding (4506 genes) and catalytic activity (3305 genes) occupied the dominant position. Only one gene related to metallochaperone activity was identified in the molecular function category. According to the literature, fatty acid synthesis-related genes are primarily classified in the metabolic process group of biological process [25,26]. Thus, fatty acid anabolic processes dominated the biological processes of Thraustochytrium sp. SZU445.  KEGG is a database of large-scale molecular data generated from molecular-level informationparticularly genome sequencing and other high-throughput experimental techniques-to understand the advanced functions and utility of biological systems. The database divides biological pathways into eight categories, each of which has subdivisions. Each subdivision is annotated with an associated gene that is then graphically displayed. Through this database annotation, it is easy to KEGG is a database of large-scale molecular data generated from molecular-level information-particularly genome sequencing and other high-throughput experimental techniques-to understand the advanced functions and utility of biological systems. The database divides biological pathways into eight categories, each of which has subdivisions. Each subdivision is annotated with an associated gene that is then graphically displayed. Through this database annotation, it is easy to find genes of all annotations related to a specific function. According to the results, 3513 genes were assigned into 6 KEGG classifications (cellular processes, environmental, genetic, human diseases, metabolism, and organismal systems) in Thraustochytrium sp. SZU445 ( Figure 5). The percentages of genes involved in the 6 KEGG classifications (metabolism and organismal systems, environmental, human diseases, genetic, cellular processes) were 10.84%, 5.65%, 13.13%, 20.66%, 34.10%, and 15.74%, respectively. Among them, the dominant classification was genes involved in metabolism pathways (1198 genes). The metabolism classification can also be divided into 12 subclassifications (metabolism of amino acid, biosynthesis of other secondary, carbohydrate, energy, lipid, cofactors and vitamins, other amino acids, terpenoids and polyketides, nucleotide, global and overview maps, glycan biosynthesis and metabolism, xenobiotic biodegradation, and metabolism). Sixty-eight genes participated in lipid metabolism, which was fifth among the remaining subclassifications. This result provided the foundation for further gene analysis of fatty acid metabolism. In particular, the number of genes involved in global and overview maps (430 genes) was the largest among all subclassifications and classifications.

Phylogenetic Analysis of Thraustochytriidae sp. SZU445
Based on the 18S rRNA sequences of the sample Thraustochytriidae sp. SZU445 and the reference strains, we used the MEGA software to construct a phylogenetic tree. The neighbor-joining method was applied for the analysis. As shown in Figure 6, Thraustochytriidae sp. SZU445 was

Phylogenetic Analysis of Thraustochytriidae sp. SZU445
Based on the 18S rRNA sequences of the sample Thraustochytriidae sp. SZU445 and the reference strains, we used the MEGA software to construct a phylogenetic tree. The neighbor-joining method was applied for the analysis. As shown in Figure 6, Thraustochytriidae sp. SZU445 was shown to share a common ancestor with all reference strains. The evolutionary position of SZU445 is lower compared with the reference strains, while it is higher than Aurantiochytrium sp. HS399. In terms of kinship, Thraustochytriidae sp. SZU445 showed a higher kinship than Aurantiochytrium sp. LY-2012.
Mar. Drugs 2019, 17, x FOR PEER REVIEW 9 of 18 Figure 6. The phylogenetic tree produced using the neighbor-joining method analysis. The evolutionary history was inferred using the neighbor-joining method. The optimal tree with the sum of branch length = 0.01309240 is shown. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1000 replicates) is shown next to the branches. The evolutionary distances were computed using the maximum composite likelihood method and are in the units of the number of base substitutions per site. The analysis involved six nucleotide sequences. Codon positions included were 1st + 2nd + 3rd + noncoding. All ambiguous positions were removed for each sequence pair. There were a total of 1739 positions in the final dataset. Evolutionary analyses were conducted in MEGA7.

Analysis of Genes Involved in Long-Chain Fatty Acid (LCFA) Biosynthesis
We selected the genes that participated in LCFA biosynthesis from the genome sequencing and protein annotation results in the KEGG database (Table 5). There are two main ways to synthesize LCFAs: The FAS pathway and the PKS pathway.  Figure 6. The phylogenetic tree produced using the neighbor-joining method analysis. The evolutionary history was inferred using the neighbor-joining method. The optimal tree with the sum of branch length = 0.01309240 is shown. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1000 replicates) is shown next to the branches. The evolutionary distances were computed using the maximum composite likelihood method and are in the units of the number of base substitutions per site. The analysis involved six nucleotide sequences. Codon positions included were 1st + 2nd + 3rd + noncoding. All ambiguous positions were removed for each sequence pair. There were a total of 1739 positions in the final dataset. Evolutionary analyses were conducted in MEGA7.

Fatty Acid Synthesis by the PKS Pathway
Polyketides, the secondary metabolites that harbor multiple units of ketide groups (-CH 2 -CO-) [45], are synthesized by PKS, which is an enzyme similar to FAS in bacteria [12]. Similar to the FAS pathway, by using ACP as a covalent connection of PUFA synthesis [39,46] (COG, COG ID: COG0236), the PKS pathway proceeds with repeated cycles. The whole cycle includes condensation by using an acyl-ACP and a malonyl-ACP to form a ketoacyl-ACP with KS (KEGG, EC: 2.3.1.179), keto-reduction by converting ketoacyl-ACP to hydroxyacyl-ACP by KR (KEGG, EC: 1.1.1.-), and dehydration by removing a water molecule from hydroxyacyl-ACP. One whole cycle can produce an unsaturated enoyl-ACP, which is then reduced to a saturated acyl chain by ER (KEGG, EC: 1.3.1.9) [47]. However, unlike the FAS pathway, the PKS pathway often neglects steps of the whole cycle, e.g., dehydration and reduction. Therefore, the products of the PKS pathway vary greatly in structure and often contain keto and hydroxyl groups and double bonds. KS, KR, and ER were discovered in Thraustochytrium sp. SZU445, but an isomerase and a DH were not identified; however, Thraustochytrium sp. SZU445 can also generate PUFAs such as DHA (C22:5) [42][43][44].
The protein annotation of Thraustochytrium sp. SZU445 revealed that the DH and isomerase of the classic PKS pathway are not present in Thraustochytrium sp. SZU445. In addition, only the delta-4 and delta-9 desaturases of the classic FAS pathway were found. Moreover, the fatty acid composition of Thraustochytrium sp. SZU445 was analyzed using gas chromatography and shown to primarily include methyl tetradecanoate (C14:0), methyl hexadecanoate (C16:0), DPA, and DHA C20:4 (7,10,13,16). Based on these results, we speculated upon the possible synthesis pathway of long-chain fatty acids in Thraustochytrium sp. SZU445.

Microbes and Cultivation
Thraustochytrium sp. SZU445 was isolated from mangroves (22°31′13.044″N, 113°56′58.560″E) in the coastal waters of southern China. Because of the results of an initial assessment for DHA production, Thraustochytrium sp. SZU445 was selected for further research. Furthermore, Thraustochytrium sp. SZU445 was stored in the China General Microbiological Culture Center (CGMCC) under the accession no. 8095. First, the strain was inoculated into M4 liquid medium, which was prepared in 100% filtered natural seawater containing 0.15% peptone, 2% glucose, 0.025% KH2PO4, and 0.1% yeast extract. Seawater was gathered from Mirs Bay, Shenzhen (22°31′32.632″N,

Microbes and Cultivation
Thraustochytrium sp. SZU445 was isolated from mangroves (22 • 31 13.044" N, 113 • 56 58.560" E) in the coastal waters of southern China. Because of the results of an initial assessment for DHA production, Thraustochytrium sp. SZU445 was selected for further research. Furthermore, Thraustochytrium sp. SZU445 was stored in the China General Microbiological Culture Center (CGMCC) under the accession no. 8095. First, the strain was inoculated into M4 liquid medium, which was prepared in 100% filtered natural seawater containing 0.15% peptone, 2% glucose, 0.025% KH 2 PO 4 , and 0.1% yeast extract. Seawater was gathered from Mirs Bay, Shenzhen (22 •

DNA Preparation and Sequencing
The sequencing platform chosen for this subject was a combination of the PacBio (BGI, Shenzhen, China) and MGI2000 (MGI, Shenzhen, China) platforms. Genomic DNA was extracted from a 50 mL fresh cell suspension of Thraustochytrium sp. SZU445, which was centrifuged at 8000 rpm for 10 min. The resulting cell pellets were ground to a fine powder in liquid nitrogen, then washed once in 5.0 mL DNA extraction buffer. Subsequently, the pellets were resuspended in 5.0 mL phenol/chloroform/isoamyl alcohol (25:24:1) and centrifuged twice at 9000 rpm for 16 min. The supernatant was then washed in an equal volume of chloroform and centrifuged at 12,000 rpm for 16 min. Genomic DNA was precipitated by adding 2.5 volumes of 100% ethanol and collected by centrifugation at 12,000 rpm for 15 min at 4 • C. The genomic DNA of Thraustochytrium sp. SZU445 was sequenced on the PacBio sequencing platform based on third-generation sequencing technologies. Prior to library construction, the concentration, integrity, and purity of the genomic DNA was tested. The concentration was detected using a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA). The purity and integrity of the sample was assessed by agarose gel electrophoresis (agarose gel concentration: 1.2%; voltage: 120 V; electrophoresis time: 50 min). To construct the library for the PacBio platform, 10 µg of genomic DNA was randomly fragmented by g-TUBE (Covaris, Woburn, MA, USA). The fragmented genomic DNA was then processed using an Agencourt AMPure XP-Medium kit (Beckman Coulter, Brea, CA, USA) to obtain fragments with a size of 10 kb. The DNA fragments generated from exonuclease ExoVII (New England Biolabs, Ipswich, MA, USA) digestion and subject to end repair were connected to the hairpin structure at both ends to form a dumbbell structure called SMRTbell. The library for the PacBio sequencing platform was successfully constructed after the purification and sorting of fragments. Qualified library fragments were sequenced after the annealing and binding polymerase steps. The genome sequencing data of Thraustochytriidae sp. SZU445 were uploaded to the National Center for Biotechnology Information Search database (NCBI) (BioProject ID: PRJNA579065; BioSample accession: SAMN13135800; SRA accession: PRJNA579065). The method for constructing the library using the MGI2000 platform (MGI, Shenzhen, China) is essentially the same as that described for the PacBio sequencing platform (BGI, Shenzhen, China).

Fragment Assembly and Gene Annotation
Since the raw sequencing data contained low-quality sequences, such as adaptors as well as duplicated and low-quality reads, these reads were filtered to obtain reliable subreads for assembly and guarantee the reliability of the subsequent analysis results. Genome assembly can be divided into the following four parts. (1) Subreads correction: Use Proovread (Version: 2.12; Parameter setting: -t 4 -coverage 60 -mode sr; related website: https://github.com/BioInf-Wuerzburg/proovread) to perform mixed correction on Subread to get highly reliable corrected reads; (2) corrected reads assembly: based on the corrected reads, Celera (Version: 8.3; parameter setting: doTrim_initialQualityBased = 1, doTrim_finalEvidenceBased = 1, doRemoveSpurReads = 1, doRemoveChimericReads = 1, -d properties -U; related website: http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.3/) and Falcon were each used to generate separate assemblies. Finally, the optimal assembly result was selected; (3) assembly result correction: single-base correction by GATK (Version: v1.6-13; parameter settings: -cluster 2 -window 5 -stand_call_conf 50 -stand_emit_conf 10.0 -dcov 200 MQ0 ≥ 4; related website: http://www.broadinstitute.org/gatk/) for assembly using second-generation small fragment data to obtain highly reliable assembly sequences; (4) scaffold and hole filling: a scaffold was constructed using SSPACE_Basic_v2.0 (Version: v2.0; related website: http://www.baseclear.com/ genomics/bioinformatics/basetools/SSPACE) on the assembly results based on the second-generation Illumina large-segment data, and pbjelly2 (Version: 15.8.24; related website: https://sourceforge.net/ projects/pb-jelly/) was used to fill holes in the scaffold. After completing the genome assembly, we first constructed the gene model for subsequent gene function annotation. We chose the de novo prediction to build the gene model. The de novo prediction uses the GeneMark-ES software (Version: 4.21, parameter settings: -genemarkes, Related website: http://exon.gatech.edu/). This software is based on Hidden Markov Model and ab initio algorithm to make predictions, and it does not require reference species and reference sequences. It is self-trained. In the assembled sequence, all possible open reading frames (ORF) were predicted as CDSs by the GeneMarkes software. After the prediction was finished, the predicted ORF was translated into the amino acid sequence. Then it was compared with 18 protein databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Version: 81), Gene Ontology (GO) (Version: releases_2017-09-08), and Swiss-Prot database (Version: release-2017-07) by BLAST to obtain corresponding functional annotation information. Since each sequence alignment result exceeds one, in order to ensure its biological significance, we retain an optimal alignment result as the annotation of the gene. The Eukaryotic Cluster of Orthologous Groups (KOG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) database pathway annotations were implemented using a separate BLAST search against both the KOG (http://www.ncbi.nlm.nih.gov/COG/) and KEGG (https://www.genome.jp/kegg/) databases. The Gene Ontology (GO) and Swiss-Prot annotations were also carried out using the Swiss-Prot database (https://web.expasy.org/docs/swiss-prot_guideline.html) and the GO database (https://www.ebi.ac.uk/ols/ontologies/go), respectively. In addition, the prediction and annotation of kinase domains in the Eukaryotic Protein Kinases and Protein Phosphatases Database (EKPD) were performed by HMMER (v.3.0) using the default parameters and the kinase database (http://ekpd.biocuckoo.org/faq.php). Moreover, for non-coding RNA prediction, the rRNA was predicted by RNAmmer software (Version: 1.2, parameter settings: -s Species -m Type -gff*.rRNA.gff -f*.rRNA.fq, Related website: http://www.cbs.dtu.dk/services/RNAmmer/). The tRNA region and tRNA secondary structure predicted by tRNAscan-SE software (Version: 1.3.1, parameter settings: -Spec_tag(BAOG) -o *. tRNA -f*.tRNA.structure, Related website: http://gtrnadb.ucsc.edu/). SnRNAs was predicted in comparison with Rfam database (Version: 9.1, parameter settings: -p blastn -W 7 -e 1 -v 10000 -b 10000 -m 8 -i subfile -o *.blast.m8, Related website: http://rfam.sanger.ac.uk/) by using the "covariance model" (CMS) of the Infranal software (Version: 1.1.1 (July 2014), parameter settings: default parameter, Related website: http://rfam.sanger.ac.uk/). The default parameters of Infranal software is displayed in Supplementary Materials.

Phylogenetic Analysis
The 18S rRNA was selected as the alignment sequence, and the 18S rRNA sequences of the reference strains were obtained from National Center for Biotechnology Information (NCBI). The blast function of the NCBI was used to compare the similarity of Thraustochytriidae sp. SZU445 and those of the reference strains. The NCBI accession numbers of the reference strains are shown in Table 6. The MEGA software (Version: 7.0.26; algorithm: neighbor-joining method, the maximum composite likelihood method for the evolutionary distances; related website: https://www. megasoftware.net/) was used to construct a phylogenetic tree for Thraustochytriidae sp. SZU445 and the reference strains.

Conclusions
In summary, this research revealed the essential genome features of Thraustochytriidae sp. SZU445 and proposed a putative fatty acid synthesis pathway (specifically for DHA and DPA). Our purpose was to identify metabolism-related genes involved in PUFA synthesis in Thraustochytriidae sp. SZU445. This study is the first to provide genome information relating to Thraustochytriidae sp. SZU445 using a third-generation sequencing platform. By assembling contigs and scaffolds, we predicted and analyzed the GO and KEGG terms for potential genes and proteins involved in the synthesis of polyunsaturated fatty acids. These data could yield novel insights into the molecular mechanisms of Thraustochytriidae sp. and serve in developing innovative strategies for promoting PUFA (including DHA) production.