Genome-Wide Analysis of the RNA Helicase Gene Family in Gossypium raimondii

The RNA helicases, which help to unwind stable RNA duplexes, and have important roles in RNA metabolism, belong to a class of motor proteins that play important roles in plant development and responses to stress. Although this family of genes has been the subject of systematic investigation in Arabidopsis, rice, and tomato, it has not yet been characterized in cotton. In this study, we identified 161 putative RNA helicase genes in the genome of the diploid cotton species Gossypium raimondii. We classified these genes into three subfamilies, based on the presence of either a DEAD-box (51 genes), DEAH-box (52 genes), or DExD/H-box (58 genes) in their coding regions. Chromosome location analysis showed that the genes that encode RNA helicases are distributed across all 13 chromosomes of G. raimondii. Syntenic analysis revealed that 62 of the 161 G. raimondii helicase genes (38.5%) are within the identified syntenic blocks. Sixty-six (40.99%) helicase genes from G. raimondii have one or several putative orthologs in tomato. Additionally, GrDEADs have more conserved gene structures and more simple domains than GrDEAHs and GrDExD/Hs. Transcriptome sequencing data demonstrated that many of these helicases, especially GrDEADs, are highly expressed at the fiber initiation stage and in mature leaves. To our knowledge, this is the first report of a genome-wide analysis of the RNA helicase gene family in cotton.

genome enabled us to identify and analyze the family of RNA helicase genes in this species. Presently, there are a few reports on studies of cotton transcriptomics [22,24]. Here, we looked at expression of the different RNA helicases in transcriptome data obtained at Peking University (Beijing, China) and accessed through NCBI [25,26]. The present study identified 161 RNA helicase genes from the G. raimondii genome. These were classified into three subfamilies, which were defined by the presence of either a DEAD-box, DEAH-box, or DExD/H-box. Detailed information about the chromosomal locations, expansion, genomic structures, and phylogenetic relationships of the RNA helicase genes is provided. Transcript profiles of 161 RNA helicase genes in mature leaves and at the 0-day-post-anthesis (DPA) and 3-DPA ovule developmental stages were investigated using transcriptome-sequencing data. Our results show that most of these helicase proteins, especially GrDEADs, might function during the fiber initiation stage. This is the first report of a genome-wide analysis of the RNA helicase gene family of G. raimondii.

Identification of RNA Helicase Family Genes in G. raimondii
RNA helicase usually contains a highly conserved adenosine triphosphate (ATP)-binding domain and a classical C-terminal domain [27][28][29]. To identify all members of the RNA helicase gene family in G. raimondii, all known Arabidopsis helicase gene sequences and all identified tomato RNA helicase gene sequences were used to query the protein database of G. raimondii using BLAST (basic local alignment search tool) analysis. This identified 161 genes, of which 151 genes had apparent helicase domains with the remaining 10 genes having either incomplete or no helicase domains. Given that the annotation of the G. raimondii genome indicated that these 10 genes encode helicase proteins, they were subjected to further analysis. This enabled us to classify all 161 putative helicase genes into three subfamilies based on their phylogenetic relationships and the structural features of the motif II region. The three subfamilies were defined by the presence of the DEAD-box (51 genes), DEAH-box (52 genes), and DExD/H-box (58 genes) motifs (Table 1, File S2). Based on their subfamily and the order in which they are found on chromosomes 1 through 13, the genes were renamed from GrDEAD1 through GrDEAD51, from GrDEAH1 through GrDEAH52, and from GrDExD/H1 through GrDExD/H58 ( Table 1). The lengths, molecular weights, isoelectric points, and predicted subcellular localizations of each Gossypium raimondii helicase protein are summarized in Table 1. We found that GrDEAD genes were distinct from GrDEAH and GrDExD/H genes. Whereas the average GrDEAD protein contains 609 amino acids (aa), GrDEAH and GrDExD/H proteins each average approximately 1100 aa in length. The average theoretical isoelectric point (pI) of GrDEAD proteins, which is approximately 8.21, is higher than that for members of the two other families. Whereas, 74.5% of the helicases proteins analyzed were predicted to be located in the nucleus, 16.1% were predicted to reside in the cytoplasm, and others were mainly predicted to be located in the chloroplast and mitochondrion (Table 1).

Phylogenetic Analysis of the RNA Helicase Family Genes in G. raimondii
The phylogenetic tree of each subfamily in our study showed that the DEAD-box ( Figure 1A), DEAH-box ( Figure 1B) and DExD/H-box ( Figure 1C) subfamilies could be further classified into four, six, and six subgroups, respectively. However, the available phylogenetic analysis of RNA helicases from Arabidopsis, rice, maize, and soybean [6] places these subfamilies into many more subclades. That study classified the DEAD-box, DEAH-box and DExD/H-box RNA helicase proteins from tomato into three, three, or five large subgroups, respectively. The diversity of these subclades indicates the extent of the variation of RNA helicase genes in plants.  MUSCLE (v3.8.31) [30], and the phylogenetic tree was constructed using the maximum-likelihood (ML) method.

Chromosomal Position and Gene Duplications
The 161 RNA helicase genes of G. raimondii are distributed across all 13 chromosomes, with different densities of their distribution along different chromosomes ( Figure 2). For example, whereas chromosome 7 contained 19 RNA helicase genes, only five RNA helicase genes were annotated on chromosome 2. Based on the definition of gene clusters, a chromosomal region containing two or more genes within 200 kb can be considered to be a gene cluster [31,32]. In G. raimondii, 18 helicase genes were identified in nine clusters (GrDEAD34-35, GrDExD/H46-47, GrDEAD40-GrDExD/H39, GrDEAH10-GrDExD/H15, GrDEAH6-GrDExD/H12, GrDEAD27-GrDEAH22, GrDEAH27-GrDExD/H31, GrDEAD48-GrDExD/H52, and GrDEAD47-GrDExD/H5) that were dispersed across several chromosomes. Two pairs of genes (GrDEAD40-GrDExD/H39 and GrDEAH10-GrDExD/H15) were arranged in tandem repeats on chromosomes 5 and 10, respectively. In addition, recent studies have shown that G. raimondii has undergone the hexaploidization event (γ-WGD) shared by the eudicots and a cotton-specific whole genome duplication [25]. To analyze the relationship between the RNA helicase genes and genome-wide duplications, we mapped the G. raimondii helicase genes to the duplicated blocks. We found that 62 of the 161 helicase genes (38.5%) had syntenic relationships ( Figure 2, File S1). Of these, 40 genes involved only two chromosome regions, nine (GrDEAD4-  spanned three chromosome regions, eight (GrDEAD3-  traversed four chromosome regions, and five (GrDEAD19- 32-35-39-41) crossed five chromosome regions (Figure 2, File S1).  We also examined the orthologous relationships between helicase genes from Gossypium raimondii and tomato, given that orthologs often retain equivalent functions during the course of evolution [33]. We found that 66 (40.99%) helicase genes from Gossypium raimondii have one or several putative orthologs in tomato (Figure 3, File S1). Of these, 27, 16, and 23 were assigned to the DEAD, DEAH, and DExD/H-box subfamilies, respectively (File S1). One member in DEAD-box family, GrDEAD7, is an ortholog of SIDEAD27 in tomato and STRS1 in Arabidopsis. Whereas, GrDEAD37 is an ortholog of SIDEAD34 in tomato and LOS4 in Arabidopsis, GrDExD/H35 is an ortholog of SIDExD/H21 in tomato and ISE2 in Arabidopsis (File S1).  Figure 4 shows the exon-intron structures and conserved domains of putative RNA helicase genes in each subfamily. The number and location of introns varied among subfamilies. In general, compared with the two other subfamilies, DEAD family genes had simpler structures and more conserved structural patterns than members of the two other subfamilies. The relative levels of conservation are exemplified by comparison of the high level of conservation evident in the comparisons  Figure 4A), with the lower level of conservation amongst members of the GrDEAH-box and GrDExD/H-box helicase family genes evident in the comparisons  Figure 4B), as well as GrDExD/H5-36, GrDExD/H43-1, GrDExD/H20-48, GrDExD/H54-30-4, GrDExD/H28-3-23, and GrDExD/H38-58-33-13-44 ( Figure 4C). Gene structures within the same subgroup of all three subfamilies were also very diverse. In addition, we found that whereas genes duplicated in different parts of the genome (such as combinations GrDEAD35-  had the same or similar gene structures, genes that had been duplicated in tandem (GrDEAD40-GrDExD/H39 and GrDEAH10-GrDExD/H15) had different gene structures, especially in terms of the numbers of exons. Domain analysis indicated the presence of a highly conserved ATP-binding domain and a classical C-terminal domain in almost all of the predicted RNA helicases. Additionally, we found a Q motif in all members of the DEAD family, except for GrDEAD16. A WW domain was observed in three members of the DEAD family. We also found that many of the DEAH and DExD/H family genes were surrounded by defined folds, such as the zf-RING, dsRBDs, and HSA domains.

Expression of RNA Helicase Family Genes in G. raimondii
The full-length RNA helicase protein sequences in G. raimondii were aligned by MUSCLE (v3.8.31) [30] and analyzed using maximum likelihood (ML) method. The attribution of proteins in Figure 5 is according to the attribution of proteins in Figure 1. Bootstrap values were calculated. The abundances of the transcripts that encode selected RNA helicases was examined at the fiber initiation stage and in mature leaves. The expression patterns of 161 RNA helicase family genes are shown in Figure 5. Of the 161 predicted genes, only six genes (those that encode GrDEAD1, GrDEAD22, GrDEAD30, GrDEAD45, GrDEAH3, and GrDExD/H15) were not expressed, whereas 141 genes were expressed both in ovules and leaves. The GrDEAD genes ( Figure 5A) showed more homogenous levels of expression and an overall higher level of expression both in ovules and leaves when compared with GrDEAH genes ( Figure 5B) and GrDExD/H genes ( Figure 5C). Seven GrDEAD genes (GrDEAD13, GrDEAD19, GrDEAD21, GrDEAD28, GrDEAD32, GrDEAD35, and GrDEAD41), one GrDEAH gene (GrDEAH23), and three GrDExD/H genes (GrDExD/H33, GrDExD/H54, and GrDExD/H58) were expressed at high levels in all three samples. A member of the DEAD-box family, GrDEAD7, which is an ortholog of Arabidopsis STRS1 was highly expressed in 0-DPA ovules and mature leaves. More than half of the genes were mainly expressed in one of the development stages and tissues tested, with 50 genes (8 GrDEADs,24 GrDEAHs,and 18 GrDExD/Hs) being the most abundant in 0-DPA ovules, 25 (7 GrDEADs,5 GrDEAHs,and 13 GrDExD/Hs) being the most abundant in 3-DPA ovules, and 14 (3 GrDEADs,6 GrDEAHs, and 5 GrDExD/Hs) being the most abundant in mature leaves. For example, GrDEAD37 (an ortholog of Arabidopsis LOS4) was highly expressed in 0-DPA ovules, whereas, GrDExD/H35 (an ortholog of Arabidopsis ISE2) was expressed in mature leaves. Whereas, the similarity of the abundance profiles of transcripts encoded by duplicated genes, such as GrDEAD19, GrDEAD35, and GrDEAD41, suggested functional redundancy between these family members, the different expression patterns of members of the GrDEAD4-  indicated that other G. raimondii helicase genes have been preserved by sub-functionalization. RNA seq data (series accession number SRA048621) was obtained from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database [26]. The color scale at the top of the dendrogram indicates the relative expression levels. Color scale represents reads per kilobase per million normalized log2 transformed counts where light green indicates low level and red indicates high level.

0DPA
3DPA Leaf Ⅰ Ⅱ Ⅲ Ⅳ  Ⅳ Ⅴ Ⅵ three organs, and were expressed at a low level in most organs. Nonetheless, given that as members in this group had a Q motif, they may be GrDEAD genes, which were assigned to the wrong subfamily owing to inadequate classification.

Discussion
RNA helicases play important roles in plant development and responses to stress. However, only a few RNA helicases have been identified in plants. Genome-wide analysis is the first step to elucidating the biological roles of members of the RNA helicase family members in certain plant species. The recent availability of genome sequences has enabled systematic investigation of this family of genes in Arabidopsis, rice, tomato, maize, and soybean. However, no RNA helicases have been characterized in cotton. This study involved a complete analysis of the RNA helicase gene family in the G. raimondii genome, including gene classification and the analysis of chromosomal locations, gene expansion, phylogenetic relationships, and structures of the genes, as well as their expression profiles at the fiber initiation stage and in mature leaves under normal growth conditions.

RNA Helicases in G. raimondii
We identified 161 RNA helicase genes in the diploid genome of the cotton species G. raimondii (Table 1). This is close to the estimated number of RNA helicase predicted from the analysis of the tomato (157), Arabidopsis (161), rice (149), maize (136), and soybean (213) genomes. The relatively large number of cotton RNA helicase genes identified likely reflects the detailed method we used to identify the RNA helicase genes. The presence of a large RNA helicase gene family in all of these species underscores that RNA helicases likely play important regulatory roles in various processes during plant growth and development. We classified these helicases into three subfamilies, which include the DEAD-box (51 genes), DEAH-box (52 genes), and DExD/H-box (58 genes) gene families (Table 1). Whereas, each of the three subfamilies of cotton RNA helicase genes consists of a similar number of genes, Xu et al. [6] have shown that the DEAD-box and DExD/H-box subfamily were larger than DEAH-box subfamily in Arabidopsis, rice, maize, and soybean. We also analyzed the predicted lengths, molecular weights, isoelectric points, and subcellular localizations of each putative helicase protein identified in the G. raimondii genome. We found that DEAD-box RNA helicases were distinct from DEAH-box and DExD/H-box RNA helicases. The apparently higher isoelectric points and smaller sizes of DEAD helicase proteins might be related to their relatively simple gene structures compared with other classes of RNA helicase. In addition, most of DEAD-box and DExD/H-box RNA helicase proteins were predicted to be located in the nucleus and cytoplasm while most DEAH-box RNA helicase proteins were predicted to reside in the nucleus (Table 1). Thus, we suggested DEAH-box RNA helicase proteins may mainly function in nuclear RNA processing. Linder et al. [34] have reported two DEAH-box RNA helicases, ESP3 and MUT6 which were located in nucleus perform different roles-RNA splicing and decay-in nuclear RNA processing. Several DEAD-box RNA helicase proteins and DExD/H-box RNA helicase proteins were predicted to be located in the chloroplast and mitochondria. Recently, there have been a few reports about RNA helicase proteins in chloroplast or mitochondria in plant. Asakura et al. [35] demonstrated that chloroplast RH3 DEAD-box RNA helicases in maize and Arabidopsis function in splicing of specific group II introns and affect chloroplast ribosome biogenesis. He et al. [36] demonstrated mitochondria ABO6 DExH-box RNA helicase in Arabidopsis involves in regulating the splicing of several genes of complex I.

Expansion of the RNA Helicase Gene Family in G. raimondii
Recent studies have shown that the G. raimondii genome has undergone at least two rounds of genome-wide duplication [25]. To detect possible relationships between RNA helicase genes and genome duplication events, we mapped 36 paralogous gene pairs (38.5%) of RNA helicase genes in G. raimondii (Figure 2, File S1). A similar percentage of paralogous pairs of RNA helicase genes was observed in Arabidopsis, rice, maize and soybean (35, 27, 25, and 62 pairs, respectively) [6]. These results suggest that the expansion of RNA helicase gene family is associated with whole-genome duplication events. However, the 38.5% value is much lower than that of the family of genes that encode NAC (NAM/ATAF/CUC) transcription factors (NAC-TFs) in G. raimondii. Of the 127 G. raimondii NAC-TF genes, 76.37% were within 307 identified syntenic blocks [37]. Given that segmental duplication events occur more often in the more slowly evolving gene families [38], the RNA helicase gene family may be evolving more rapidly than most other gene families in the cotton genome. In addition to the whole-genome duplication event, gene families can also arise through tandem amplification. For instance, in Chinese plum, tandem duplications played a key role in the expansion of the AP2/ERF family [39]. In G. raimondii, Shang et al. [37] also detected 20 tandem duplications associated with NAC-TF genes. However, only two gene pairs in our study showed evidence of having participated in tandem duplication. The mechanisms that supported the expansion of the RNA helicase gene family may be more complicated than is suggested by our classification; the specific mechanisms involved require further investigation. Marchat et al. [27] reported that the evolution of RNA helicase proteins involved gene fusion. Moreover, we speculate that given that the majority of RNA helicase gene family members play vital and diverse role in plants, there would be strong selection against variation in their copy numbers. The DEAD genes showed the greatest degree of duplication. Given the more simple and conserved gene structures and domains, as well as higher expression of members of this subfamily than those of others, we propose that DEAD helicase genes may evolve more slowly and may play more basic roles in plant growth and development than the members of other subfamilies of RNA helicases.

Phylogenetic Analysis and Gene Structural Organization
The full-length RNA helicase protein sequences in tomato and G. raimondii were aligned by MUSCLE and analyzed using the more accurate maximum likelihood (ML) method (File S2). Members of the DEAD-box, DEAH-box, and DExD/H-box subfamilies were further classified into four, six, and six subgroups, respectively (Figure 1), although Xu et al. [6] placed them into many more subclades following phylogenetic analysis of the RNA helicases of Arabidopsis, rice, maize, and soybean. Those workers classified the DEAD-box, DEAH-box, and DExD/H-box RNA helicase proteins from tomato into three, three, or five large subgroups, respectively. The diversity in the number and compositions of the subclades from different species indicate variation in the compositions of different RNA helicase gene families from different plant species. In addition, analysis of the exon-intron structures and sequences of the conserved domains can provide insights into the evolution of gene families. Our results showed that the numbers and locations of introns varied among subfamilies. Members of the DEAD-box subfamily had comparatively simple and conserved structural patterns. Genes from the GrDEAH-box and GrDExD/H-box subfamilies were less conserved and more diverse than those from the DEAD-box subfamily. It is noteworthy that genome-wide duplicated genes had the similar or same gene structures, whereas, tandem duplicated genes did not; this requires further analysis. Domain analysis indicated that an ATP-binding domain and a C-terminal domain were highly conserved in these putative RNA helicase proteins, whereas, more diversity was evident amongst the other domains present in each subfamily. The most characteristic feature of the DEAD-box family is the conserved Q motif [40]. We found Q motifs in all members of DEAD genes, except for GrDEAD16. In addition, three of the family members were found to have a WW domain. The WW domain has been implicated in mediating protein-protein interactions and linking cell signaling to the membrane cytoskeleton [41]. Whereas, DEAH genes lack a Q motif, most members of group II of DExD/H family have a Q motif. This might be attributed to the not-very-strict classification criteria used to distinguish between members of the DEAD and DExD/H subfamily. Many of the DEAH and DExD/H family genes were surrounded by defined folds, such as the zf-RING, dsRBD, and HAS domains, which may extend the length of the helicase. These regions influence or even define the function of a helicase [42]. The great diversity in these helicases may allow them to regulate many specific pathways in plants.

Expression Analysis Based on Transcriptome Sequencing Data
RNA helicases rearrange RNA secondary structure, potentially playing roles in any cellular process that involves RNA metabolism [3]. Of the 161 predicted genes in our study, 141 (87.6%) RNA helicase genes were expressed both in ovules and leaves. Xu et al. [6] reported that more than 80% RNA helicase genes in Arabidopsis, rice, and maize were expressed in at least one of the development stages and tissues tested. The high expression level of this RNA helicase gene family in all of these species further indicates that the RNA helicases may play important roles in various processes. The GrDEAD genes were more homogenous in terms of their level of expression and were expressed at higher levels both in ovules and leaves when compared with GrDEAH genes and GrDExD/H genes ( Figure 5). The DEAD-box family member GrDEAD7, which is an ortholog of Arabidopsis STRS1, was highly expressed in 0-DPA ovules and mature leaves. The RNA helicases STRS1 and STRS2 have been shown to be involved in responses to various abiotic stresses [9]. GrDEAD genes may be more important in diverse cellular processes than GrDEAH genes and GrDExD/H genes. However, Xu et al. [6] have shown the DEAH-box RNA helicase genes higher proportion of the development stages and tissues in Arabidopsis, rice, and maize than DEAD-box RNA helicase genes and DExD/H-box RNA helicase genes. This phenomenon might be attributed to diversity between the different crops investigated or the diversity of the development stages and tissues examined. Moreover, tissue specificity of the expression of genes is commonly observed in plants. For example, cyclin dependent kinases-like proteins (CKL) of Arabidopsis show strong tissue specificity of expression, with CKL12 being root specific, CKL3 induced in stem tumors and callus, and CKL6 showing strong expression only in leaf tissue [43]. Though Arabidopsis ISE2 has been shown to be involved in posttranscriptional gene silencing, and the absence of the ISE2 affects a critical factor required for correct plasmodesmata (PD) formation and function [16], the duration of PD closure was positively correlated with the final fiber length attained [44]. Thus, GrDExD/H35 is not expressed at the initiation fiber stage.

Identification of RNA Helicase Genes
The latest version (V2.0) of the Gossypium raimondii genome and protein sequences was downloaded from CottonGen [45]. To identify the members of the RNA helicase gene family in Gossypium raimondii, all known Arabidopsis helicase gene sequences and the identified tomato RNA helicase gene sequences [5] were used as queries to perform multiple database searches using BLASTP [46]. They were downloaded from The Arabidopsis Information Resource (TAIR) [47] and the plantGDB database [48], respectively. After selection of G. raimondii proteins with at least 50% identity with the query sequence, the candidate helicases proteins were aligned with each other to ensure that no gene was represented multiple times. All of the remaining protein sequences were examined using the domain analysis program PROSITE [49] with the default cutoff parameters. The three fields (length, molecular weight, and isoelectric point) of each Gossypium raimondii helicase protein were calculated using the online ExPasy program [50]. Subcellular localization was analyzed using the CELLO v2.5 server [51].

Phylogenetic Analysis
The full-length protein sequences of the helicase genes were aligned using the MUSCLE (v3.8.31) [30] program with the default settings. Phylogenetic trees were constructed by employing the maximum-likelihood (ML) method of the phyML (20120412) program [52] with the WAG (Whelan and Goldman) substitution model. Bootstrap values were calculated using the aLRT (approximate likelihood ratio test) model with the default cutoff parameters.

Chromosome Localization and Gene Duplications
Synteny analysis was conducted locally using a method similar to that developed for the Plant Genome Duplication Database [53]. Mcscan [54] was employed to identify homologous regions, and syntenic blocks were evaluated using Circos-0.64 [55]. Default parameters were used in all steps. Tandem duplication was characterized as multiple genes of one family located within the same or neighboring intergenic region [39].

Gene Structure and Domain Analysis
All of the putative protein sequences were analyzed using the domain analysis program PROSITE [56] at ExPASy [57]. Exon-intron structure information of these helicase genes was parsed from the GFF (Generic Feature Format) file downloaded along with the genomic data. Gene structures of the helicase genes were generated using the GSDS (Gene Structure Display Server) algorithm [58].

Gene Expression Analyses
The expression pattern of the helicase genes was analyzed using transcriptome sequencing data from mature leaves, 0-DPA ovules, and 3-DPA ovules of G. raimondii. These data were obtained from the NCBI Sequence Read Archive (SRA) [25]. The accession numbers were: SRX111367, SRX111365, and SRX111366, respectively. The search was performed using nucleotide signatures at least 20 nucleotide long. Reads mapping were performed by BWA (0.7.5a-r405) [59] with the default parameters except that the seed length was set to be 31. Sequenced reads that were mapped on these helicase sequences were converted to RPKM in order to estimate gene expression levels [60,61]. The formula used was: where C is the number of reads that were uniquely aligned to the transcript, N is the total number of reads that were uniquely aligned to all the transcripts in a specific sample and L is number of bases in the transcript.

Conclusions
Our study has reported a genome-wide analysis of the important RNA helicase gene family in G. raimondii. Based on their expression analysis, we hypothesize that GrDEAD37 and GrDExD/H35 might function during the fiber initiation stage. Only 38.5% of the putative RNA helicase genes were mapped to the previously identified syntenic blocks. The specific mechanisms used during the expansion of the RNA helicase family might be more complicated than suggested by our analysis, and require further investigation. Of the subfamilies of RNA helicases from G. raimondii, the GrDEADs have undergone the greatest degree of duplication and have the most conserved structural patterns and highest levels of expression, when measured at the level of transcript abundance. This suggests that the GrDEAD gene subfamily may evolve more slowly than the two other subfamilies, and that GrDEAD genes play a more important and basic role in cotton than DEAH or DExD/H genes. This study should provide a solid foundation for future functional studies and for guiding future experimental work on helicase genes in plants.