RNA-Seq as an Emerging Tool for Marine Dinoflagellate Transcriptome Analysis : Process and Challenges

Dinoflagellates are the large group of marine phytoplankton with primary studies interest regarding their symbiosis with coral reef and the abilities to form harmful algae blooms (HABs). Toxin produced by dinoflagellates during events of HABs cause severe negative impact both in the economy and health sector. However, attempts to understand the dinoflagellates genomic features are hindered by their complex genome organization. Transcriptomics have been employed to understand dinoflagellates genome structure, profile genes and gene expression. RNA-seq is one of the latest methods for transcriptomics study. This method is capable of profiling the dinoflagellates transcriptomes and has several advantages, including highly sensitive, cost effective and deeper sequence coverage. Thus, in this review paper, the current workflow of dinoflagellates RNA-seq starts with the extraction of high quality RNA and is followed by cDNA sequencing using the next-generation sequencing platform, dinoflagellates transcriptome assembly and computational analysis will be discussed. Certain consideration needs will be highlighted such as difficulty in dinoflagellates sequence annotation, post-transcriptional activity and the effect of RNA pooling when using RNA-seq.


Introduction
Dinoflagellates are a marine unicellular eukaryotic organism and these organisms are one of the major components of marine phytoplankton besides diatoms [1].Dinoflagellates have been studied extensively for their role in maintaining coral reefs and their toxin production especially during the event of Harmful algae blooms (HABs) in the ocean [2][3][4].HABs caused by dinoflagellates has been associated with water discoloration and accumulation of toxins through the marine food chain, gave a severe negative impact toward economic and health [5,6].However, their exceptionally large genome size  and permanently condensed chromosomes hinder the genomic approach to unravel dinoflagellates genome complexity especially in regards to their toxin production [7].Nevertheless, transcriptomic is deemed as a practical alternative for genome survey of non-model organism such as dinoflagellates [8].
Transcriptomic is a reference to the total RNA presence and quantity in cells including messenger RNA (mRNA), small RNA (smRNA), ribosomal RNA (rRNA), transfer RNA (tRNA) and non-coding RNA [9].Unlike genome, which is relatively stable and unchanged, transcriptomes tend to change under different environmental conditions and growth stages [10].Thus, transcriptomic acts as a powerful tool to unravel the complexity of dinoflagellates physiology and gene regulation.
Transcriptomic will be able to identify and characterize the total RNA expressed in the cells during a particular growth stage or time, to quantify differently expressed genes and to determine functional structure of the gene [11].The advancement of the Next-Generation Sequencing (NGS) technologies surely provide a huge impact on the profiling of the dinoflagellates transcriptome (Table 1).RNA-seq is one of the transcriptomic methods to sequence the entire transcriptome in the cell at high resolution with a low cost.This method has been employed extensively for both eukaryotic and prokaryotic organisms [10,[22][23][24].Thus, in this review, we provide an overview of the process and technical aspect of dinoflagellates transcriptomics technology using RNA-seq and highlight the considerations and challenges using RNA-seq.

Dinoflagellates Transcriptomics: A Brief History
In the early stages of the dinoflagellates transcriptomics study, the differential-display reverse transcription-PCR (DDRT-PCR) was used to identify the over-expressed or under-expressed genes in dinoflagellates [25][26][27].In general, DDRT-PCR requires a large amount of RNA, cannot recognize under or over-expressed genes accurately and can only compare two types of population simultaneously [28].Due to these issues, this approach was not popular among scientists and the microarrays have remained the most popular approach for the transcriptome profiling for almost a decade (Table 2).

Year Event
Microarray technology has the ability to simultaneously detect tens of thousands of transcriptomes and has led to important advances in tackling a wide range of biological problems, including the identification of genes that are differentially expressed between different environmental conditions and growth phases [36,42].Despite this, microarray technology has several limitations, for example the background hybridization limits the accuracy of the expression measurements, particularly for transcripts present in low abundance.Microarray also has limitations in capturing only those genes for which probes are designed, thus, probe design is crucial for microarray approach and the target sequence is needed prior to designing the probe [43].
Another method which has been applied for dinoflagellates transcriptomics is Serial Analysis of Gene Expression (SAGE).This method enables the large-scale analysis of transcriptome without prior knowledge of the genes thus making it the superior method, compared to microarray [44].Coyne et al. [37] developed a novel approach to facilitate the construction of SAGE libraries for dinoflagellates which require a lower amount of RNA to identify genes expressed during toxic zoospores stages of Pfiesteria shumwayae.Apparently, the SAGE method did not get too much attention and this method was soon replaced by Massively Parallel Signature Sequencing (MPSS) which covers a deeper sequencing depth and more sensitive to low abundance genes compared to SAGE [2,38].Utilization of the MPSS method has proven useful for Alexandrium spp.transcriptome profiling and its regulation [2,38].
Currently, the RNA-seq is a new option for studying the whole genome transcriptome profile of an organism.RNA-Seq is a direct sequencing of transcripts by high-throughput sequencing technologies using next-generation sequencing (NGS).The RNA-seq approach does not depend on knowledge of the target sequence prior to probe selection and can avoid the related biases introduced during hybridization of microarrays [43].RNA-seq is also capable of measuring the abundance of expressed transcripts for both low-levels and high-levels transcripts with a greater sensitivity, compared to the microarray method [11].The high-throughput NGS analysis plays a major role in molecular biology field to measure the activity of thousands of genes in parallel.Changes in gene expression due to different environmental conditions or growth stages can be examined through absolute quantification of transcripts.Although the RNA-seq method is still new in transcriptome analysis, it has considerable advantages in application of gene expression profiling, alternative splicing, SNP discovery, mapping and quantification of transcriptome compared to other transcriptome profiling methods [45].We strongly believe that RNA-seq strategy will revolutionize dinoflagellates transcriptome analysis as the entire transcriptome can be sequenced in a high-throughput manner with low cost, high sensitivity and without introducing any bias like other transcriptomic methods.

Isolation of RNA
For efficient and high-throughput transriptome profiling, the most important operation is to obtain high quality RNA samples.The first step of RNA extraction is to release all the RNA by effective disruption of the tough dinoflagellate cell wall without degrading the RNA in the cell (Figure 1).This step is often very challenging due to the fact that dinoflagellates are known to have thick cell wall.Available commercial lysis buffers are known inefficiently to lyse dinoflagellates cells completely, therefore are unable to prevent the RNA degradation.It is essential to use extra technique to disrupt the cells or tissue of dinoflagellates to obtain maximum yield and high quality of the RNA.
Several cell disruption methods have been tested (bead beating, grinding using a micropestle and sonication) against dinoflagellates Alexandrium catenella [46].Cell disruption using micropestle were reported the most efficient method to disrupt the dinoflagellates cells.Bead beating and sonication for more than 20 s always lead to degradation of the RNA due to the combination of the RNA shearing and warming of the sample, which would activate RNases.Rosic and Hoegh-Guldberg [47] performed a few RNA extraction protocols to evaluate the most efficient method for extracting the RNA for dinoflagellates.Several methods were tested and interestingly a modified RNA extraction procedure that combines two existing commercial kits, Trizol and Qiagen RNeasy kits were demonstrated to recover high-quality RNA under specific homogenization conditions.
Next, the quality of the obtained RNA is tested on denaturing formaldehyde gels for the presence of intact 18S and 28S rRNA bands.Presence of smearing on the gel indicates that the obtained RNA was degraded and not suitable for sequencing [24].The most recent protocol utilizing Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) electropherogram which defines the RNA quality using RNA Integrity Number (RIN) calculated by the machine through generated electropherogram and RIN.Values higher than 6 represent RNA with preserved integrity [24,47].However, several studies have suggested that researchers should not blindly trust the RIN number as a presence of chloroplast as rRNA from dinoflagellates may interfere with RIN calculations.The generated electropherogram should be interpreted manually for the presence of a clear peak of 18S/28S rRNA and a flat inter-region between those regions represent a good quality of RNA.Following RNA extraction, cDNA library will be generated and then will be sequenced using NGS platform.Then, the generated reads will be assembled and annotated before computational analysis can be done.
Next, the quality of the obtained RNA is tested on denaturing formaldehyde gels for the presence of intact 18S and 28S rRNA bands.Presence of smearing on the gel indicates that the obtained RNA was degraded and not suitable for sequencing [24].The most recent protocol utilizing Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) electropherogram which defines the RNA quality using RNA Integrity Number (RIN) calculated by the machine through generated electropherogram and RIN.Values higher than 6 represent RNA with preserved integrity [24,47].However, several studies have suggested that researchers should not blindly trust the RIN number as a presence of chloroplast as rRNA from dinoflagellates may interfere with RIN calculations.The generated electropherogram should be interpreted manually for the presence of a clear peak of 18S/28S rRNA and a flat inter-region between those regions represent a good quality of RNA.

Preparation of cDNA Library
Currently, RNA-Seq approach relies on the transforming of RNA into cDNA molecules using random hexamers or oligo(dT) primers before sequencing, as RNA cannot be sequenced directly [48].It is possible for researchers to directly sequence total cellular RNA, but the RNA sequences will likely be dominated by non-mRNA sequences as more than 80% of total cellular transcriptome composed of the ribosomal RNA (rRNA) [49].Therefore, for high-throughput sequencing it is generally necessary to remove rRNA from the total RNA.The next step for the construction of a cDNA library is the RNA fragmentation process.This process is important to fragmentize the long RNA into short sizes of fragments, according to the NGS sequencing platform.In most cases, RNA is fragmented prior to the reverse transcription to cDNA molecules allowing a more uniform fragment across the transcript and allowing for a more Following RNA extraction, cDNA library will be generated and then will be sequenced using NGS platform.Then, the generated reads will be assembled and annotated before computational analysis can be done.

Preparation of cDNA Library
Currently, RNA-Seq approach relies on the transforming of RNA into cDNA molecules using random hexamers or oligo(dT) primers before sequencing, as RNA cannot be sequenced directly [48].It is possible for researchers to directly sequence total cellular RNA, but the RNA sequences will likely be dominated by non-mRNA sequences as more than 80% of total cellular transcriptome composed of the ribosomal RNA (rRNA) [49].Therefore, for high-throughput sequencing it is generally necessary to remove rRNA from the total RNA.The next step for the construction of a cDNA library is the RNA fragmentation process.This process is important to fragmentize the long RNA into short sizes of fragments, according to the NGS sequencing platform.In most cases, RNA is fragmented prior to the reverse transcription to cDNA molecules allowing a more uniform fragment across the transcript and allowing for a more comprehensive analysis of transcriptome, but resulting in depleted transcript ends [10].However, when using the Roche 454 platform, the fragmentation is done after the reverse transcription to cDNA molecules, but rendering the sequencing depth lower than Illumina or SOLiD platform [24].

High-Throughput Sequencing
From the cDNA obtained, each cDNA fragment, with or without amplification, is then sequenced in a high-throughput manner to obtain short sequences from one end (single-end sequencing) or both ends (pair-end sequencing) using the NGS platform [11].Current sequencing step for dinoflagellates RNA relies on the widely used NGS platform such as Illumina and Roche 454 [15,50].Each sequencing platform has their own advantages and disadvantages thus a careful consideration needs to be addressed according to the experiment goals.In general, both of the platforms have substantial variances in read length and sequencing procedures.For instance, Roche 454 offers longer reads up to 500 nt which significantly will reduce the effort to assemble the transcripts ambiguously [51].On the other hand, Illumina provides equivalent or better base-call error, frameshift frequency and contig length assemblies than Roche 454 [51].Thus, to have low sequencing error rate and greater sequencing depth, Illumina seems to be the best choice as this platform offer <1% mismatch error and high sequencing depth that will facilitate detection of low expressed genes [51].Detailed comparison and technical aspect of each NGS platform for RNA-seq including methodological procedure and sequencing chemistry will not be discussed here as it has already been discussed in a few review articles [52][53][54].Another option for RNA-seq is using two or more platforms for high-throughput sequencing.For example, the longer reads of Roche 454 datasets can be used as a scaffold to connect the short reads of SOLiD datasets [55].

De Novo Transcriptome Assembly
Following sequencing by the NGS platform such as Illumina and Roche 454, the sequence reads generated are often ranging from 50 to 330 bp resulting in the necessity for the reconstruction of full length transcripts using a suitable assembler program [56].However, the reads generated contain artefacts originated from low complexity reads, adapter mismatch or even during the sequencing.There are several tools available for RNA-seq raw data trimming [57].The generated reads are either aligned to a reference genome or transcripts.For non-model organisms like dinoflagellates, reference genomes do not exist.De novo transcriptome assembly is used to produce a genome-scale transcription map that consists of both the transcriptional structure and/or level of expression for each gene [11].
Although recently draft genomes of the coral symbionts dinoflagellates from the genus Symbiodinium were obtained and may serve as key model organism for transcriptome assembly, it was found to not be densely covered by high-quality draft genome sequences [14,18,[58][59][60].Moreover, dinoflagellates transcriptome exhibits complex alternative splicing patterns and RNA-seq samples genetically exhibit polymorphism [61,62].Thus, the de novo transcriptome assembly approach remains the best option for dinoflagellates transcriptome assembly.Several de novo transcriptome assembly programs have been used widely in dinoflagellates transcriptomics studies such as Trinity, ABySS de novo and SOAPdenovo [15,[63][64][65].In general, these assemblers utilize the De Bruijn graph-based approach where the generated reads are broken into smaller reads called k-mers.The overlapping of these k-mers are then used to create a contig.A more comprehensive working principle of how De Bruijn graph-based approach works in de novo transcriptome assembly can be found here [66].

Sequence Annotation
Complete annotation and quantification of all genes and their isoforms across samples is crucial for understanding the biology of organism [67].Functional annotation is a process in transcriptomics data analysis to categorize the genes presence in the data generated to their respective functional classes, which can be very useful in understanding the physiological meaning of large amounts of genes and to assess functional differences between subgroups of sequences [68].However, to annotate each of the transcripts as completely as possible requires a powerful computational tool or else the value of generated data will be minimal.In dinoflagellates transcriptomics study, the contig usually annotated by Blast alignment against NCBI non-redundant protein (NR) and nucleotide (NT), Kyoto Encyclopedia of Genes and Genomes (KEGG), SwissProt and InterPro database [34,69,70].Blast2go is a powerful computational tool for high-throughput functional annotation of RNA sequences and data mining based on Gene Ontology (GO), for a non-model organism with a user-friendly interface and is less labor intensive [71].

Computational Analysis for Differently Expressed Genes
The final goal of the majority dinoflagellates transcriptomics study is focusing on the targeting of differently expressed genes in the different biological conditions to get a better understanding about molecular regulation in the dinoflagellates physiology [1,34,72].RNA-seq provides data in the form of read coverage to quantify transcript abundance presence in the sample to provide additional information such as a distribution for a particular transcript [67].The method to analyse differently expressed genes is built based on the statistical model.Fisher's exact test and likelihood ratio test is based on the assumption of the number of reads coming from a gene or transcript isoform following a binomial, or Poisson distribution [11].Several studies have been done also applying Fisher's exact test using several statistical packages such as DEGseq and IDEG6 to find the differently expressed genes in dinoflagellates during different biological conditions [2,50,63].

Challenges and Consideration in Dinoflagellates Transcriptome Studies Using RNA-Seq
Despite the powerful potential of the transcriptomics approach for gene expression study in dinoflagellates, there are a few limitations that have been reported in this approach.Transcriptome surveys in non-model species will be much easier if there are genomic resources available for a related species to aid in assembly and functional annotation purposes [8].Annotation of dinoflagellates genes remains as one of the biggest challenges due to lack of genomics data for dinoflagellates [34].Automated annotation of A. minutum contigs only yield a hit for 28% of the contigs and the annotation of Oxyrrhis marina transcriptomics data also failed to identify 76% of these contigs [34,50].
As some of differently expressed genes cannot be annotated, this is clearly one of the major limitations not just to reveal the genetic basis for any traits but also for elucidating the environmental adaptation mechanism.A study done by Gibson et al. [73] found that there is no strong correlation between either low sequence or assembly quality to these un-annotated genes in eukaryotic genomes.These un-annotated genes might be an indication for a novel pathway to yet be discovered in dinoflagellates [74].Recently, two draft genomes of dinoflagellates have been sequenced [7,58].Approximately 42,000 protein-coding genes were detected in the S. minimum draft genomes and this work will give hope and insight on further gene identification and transcriptome assembly in dinoflagellates.
RNA turnover also needs to be considered during the cell harvesting prior to RNA extraction especially for the gene expression studies in dinoflagellates.The extracted RNA in the ready-state abundance of mRNAs in a cell must represent an actual snapshot of the transcriptome during the desired condition.The mRNAs act as template for protein synthesis and it also provides the rates of transcription and degradation in the cells.Survey of mRNA stability in dinoflagellates using in vivo labeling method revealed quite diverse mRNA half-life in Karenia brevis ranging from 42 min to 144 h [75].However, a recent report demonstrated that mRNA degradation may occur rapidly with some genes start to degrade as early as 2 min [76].Utilizing RNA stabilization solution such as RNAlater and RNAlater ICE seems effective to stop the dinoflagellates gene expression without compromising RNA integrity [46].
Although mRNA is eventually translated into protein, there is a poor correlation between mRNA and protein abundance and these correlations are far from having a linear and simple relationship [77,78].Several post-transcriptional events may uncouple the correlation between mRNA and protein abundance, including the presence of a RNA secondary structure, regulatory RNA such as miRNA, codon bias, translational efficiency and protein turnover [78].To date, the transcriptomics approach using either microarray or RNA-seq only record the levels of RNA (cDNA) present and are thus not able to detect any post transcriptional events [79].Although until now there is no detailed study regarding correlation between mRNA and protein abundance in dinoflagellates, this limitation needs to be considered as there is strong proof for post-transcriptional gene regulation such as regulation by miRNA in dinoflagellates [4,19,24,33,[80][81][82].
Proteomic and western blotting are the two most common methods to validate protein expressions in dinoflagellates that may help to detect the post transcriptional events, however both methods are hard to be applied due to the difficulty in protein identification and the availability of suitable antibodies for dinoflagellates [80,83].Evans [84] also reported that the differently expressed genes during different experimental conditions are not correlated with the phenotypic changes of the organism.Differently expressed genes are always targeted by researchers to capture the biomarker for environmental stress conditions or even for toxin biosynthesis.However, different types of stress response may be resulting in the same kind of cellular damages [84].Analysis of dinoflagellates transcriptomics data revealed that there is overlapping response between different types of environmental stress such as photosynthesis related genes [1,42].Thus, a careful and detailed bioinformatics analysis is needed to elucidate specific stress related genes.Harvesting time for RNA extraction is crucial, especially for time series expression experiments.This is due to the fact that the cell will only express a small fraction of the genes that are present in their genome.Multiple harvesting time is recommended to allow one to distinguish between primary and secondary responses especially related to a stress study [85].
RNA pooling is one of the practices used by some researchers to increase the dinoflagellates RNA diversity representation without increasing the cost of sequencing and can help to reduce workload intensity in a large-scale experiment [12,34,86,87].RNA samples from multiple experimental conditions or different growth stages are pooled together before preparation of the cDNA library to reduce the effects of biological variation and increased the diversity of RNA.Kendziorski et al. [88] demonstrated that RNA pooling is reducing the effects of biological variation only, but not the biological variation itself.This practice is often desirable when the main interest of the study is to gain vast transcriptomics data from the same individual without concerning the expression level (e.g., comparison of toxic and non-toxic dinoflagellates) as the transcriptome expression level is subjected to non-linear distortion during data normalization [34,88].Another limitation is the fact that the information regarding the biological variation cannot be traced back and differently expressed genes from individual RNA samples cannot be detected [89].It is crucial to check the quality of individual RNA samples before pooling the RNA as the quality of transcriptomics data will be affected by a contaminated or degraded sample [86].The study on comparative pooled RNA in toxic versus non-toxic strains of A. minutum revealed that the putative candidates involved in toxin biosynthesis and regulation, or acclimation of PSP toxins were only present in the mRNA pools of toxic strain but some of these genes could not be annotated due to the lack of known similar sequences.

Conclusions
The RNA-seq method is a powerful tool for dinoflagellates transcriptome study and we have highlighted a brief overview of dinoflagellates technical aspect of transcriptome strategies using RNA-seq as well as a few considerations for this method.Application and utilization of RNA-seq strategies for dinoflagellates transcriptome analysis will benefit us with the knowledge especially regarding gene regulation, novel gene and single nucleotide polymorphism discovery in this mystery of phytoplankton, dinoflagellates.

Processes 2018, 6 , 5 4 of 12 Figure 1 .
Figure1.Current strategies and workflow for dinoflagellates transcriptomics study.Following RNA extraction, cDNA library will be generated and then will be sequenced using NGS platform.Then, the generated reads will be assembled and annotated before computational analysis can be done.

Figure 1 .
Figure1.Current strategies and workflow for dinoflagellates transcriptomics study.Following RNA extraction, cDNA library will be generated and then will be sequenced using NGS platform.Then, the generated reads will be assembled and annotated before computational analysis can be done.

Table 1 .
Representative publication for application of RNA-seq in dinoflagellates transcriptome study.

Table 2 .
A brief history in dinoflagellates transcriptome analysis.