Genome-Wide Identification and Evolution of HECT Genes in Soybean

Proteins containing domains homologous to the E6-associated protein (E6-AP) carboxyl terminus (HECT) are an important class of E3 ubiquitin ligases involved in the ubiquitin proteasome pathway. HECT-type E3s play crucial roles in plant growth and development. However, current understanding of plant HECT genes and their evolution is very limited. In this study, we performed a genome-wide analysis of the HECT domain-containing genes in soybean. Using high-quality genome sequences, we identified 19 soybean HECT genes. The predicted HECT genes were distributed unevenly across 15 of 20 chromosomes. Nineteen of these genes were inferred to be segmentally duplicated gene pairs, suggesting that in soybean, segmental duplications have made a significant contribution to the expansion of the HECT gene family. Phylogenetic analysis showed that these HECT genes can be divided into seven groups, among which gene structure and domain architecture was relatively well-conserved. The Ka/Ks ratios show that after the duplication events, duplicated HECT genes underwent purifying selection. Moreover, expression analysis reveals that 15 of the HECT genes in soybean are differentially expressed in 14 tissues, and are often highly expressed in the flowers and roots. In summary, this work provides useful information on which further functional studies of soybean HECT genes can be based.

The HECT ubiquitin ligase is an important class of E3 enzymes. HECT E3s are single polypeptides characterized by the presence of a C-terminal 350-amino acid-length HECT domain. The common features of HECT E3s are the C-terminal catalytic HECT domain, and the N-terminal domains, which recruit specific substrates for ubiquitin ligation [7,12]. The C-terminal HECT domain includes two essential binding sites: a ubiquitin-binding site, and an E2-binding site [7,12]. It also includes two sub-structures: the C-lobe, which receives ubiquitin from E2 and links itself with ubiquitin, and the N-lobe [21]. Classification of a particular HECT E3 protein into one of the different subfamilies is based on the arrangement of the N-terminal domains [7,22,23]. These two modular architectures, the N-terminal substrate-binding domains and the C-terminal HECT domain, govern the polypeptides' interactions with various substrates, as well as their regulatory functions. Substrates often contain recognition sequences, which can bind directly to the N-terminal substrate-binding domains [21,[24][25][26][27]. The unique HECT domains are crucial to the identification and evolution of the HECT genes in plant genomes, and merit intensive research.
As the smallest E3 subfamily, HECT comprises seven genes (named UPL1-UPL7), which have been identified in Arabidopsis thaliana [7]. Recently, 413 plant sequences containing the HECT domain were identified via TBlastN analysis, which compared multiple HECT sequences to entries in the NCBI database [22]. However, due to the lack of corresponding data from other genomes, the process of identifying HECT genes in other plant species is not complete. Although a genomic survey of eukaryote HECT ubiquitin ligases was performed, the number plant of species included in the research was limited [23]. The plant species with fully analyzed HECT genes is Arabidopsis thaliana [3,6,7]. In this study, we performed a genome-wide analysis of the HECT domain-containing genes in soybean, ultimately identifying 19 HECT genes. We also performed a comprehensive phylogenetic analysis of 365 HECT genes from 41 plant species. These 365 HECT genes included the 19 soybean HECT genes and a subset of HECT genes from four plant species, including Arabidopsis thaliana, Glycine max, Medicago truncatula, and Phaseolus vulgaris. A detailed analysis of gene structure, domain architecture, chromosome location, duplication pattern, and expression pattern was performed. It is interesting to note that all 19 soybean HECT genes are located in the duplicated blocks of the genome, which suggests that segmental duplications have made crucial contributions to the expansion of HECT genes in this plant species. Moreover, we used the RNA-seq expression profiles of 14 soybean tissues to study the expression patterns of the different HECT genes. Our work provides information that is useful for further investigation of the various functions of the HECT gene family in soybean.

Phylogenetic Analysis of HECT Genes in Soybean
To determine the nature of the evolutionary relationship between soybean HECT genes and those of other plant species, we performed multiple sequence alignments, and constructed a maximum likelihood phylogenetic tree for the 365 plant HECT proteins of the 41 plant species in Phytozome v9.1, including the 19 soybean HECT genes. The conserved HECT domain sequences (File S1) (about 350 amino acids in length) were used in the analysis, because of the different lengths and various domain architectures of the HECT proteins. Three hundred and sixty-five plant HECT genes from Viridiplantae can be classified into seven groups (Group I-VII), with the exception of some genes from the lower land plants (Figures 1 and S2). These seven groups can be further grouped into five subfamilies corresponding to those described in a previous study [22]. To further examine the evolutionary characteristics of soybean HECT genes, the phylogenetic relationships of the full-length HECT proteins of Glycine max, Medicago truncatula, Phaseolus vulgaris, and Arabidopsis thaliana (outgroup) were analyzed. As shown in Figure 2, Arabidopsis HECT genes are consistently separated from those of other species. The 19 soybean HECT genes can also be subdivided into these seven groups (Figures 2-4). In soybean, groups I, III, V, and VII each contain two genes, groups II and VI each contain four genes, and group IV contains three genes. However, in Arabidopsis thaliana, groups III-VII each contain only one gene, Group I contains two genes as in soybean, and Group II does not contain any HECT genes.

Figure 2.
Neighbor-joining (NJ) tree of HECT genes from Glycine max, Medicago truncatula, Phaseolus vulgaris, and Arabidopsis thaliana. MEGA6 package was used to construct the NJ tree from the full-length amino acid sequence alignments (File S2) of the four plant species, with 1000 bootstrap replicates. Numbers refer to bootstrap support (in terms of percentage).

Domain Architecture and Exon-Intron Structure of the Soybean HECT Genes
To better understand the structural diversity of HECT genes, the exon-intron structures of the soybean HECT genomic sequences, and the domain architectures of the soybean HECT proteins were compared, according to their phylogenetic relationships. Each gene structure was obtained by comparing its coding sequences to its genomic sequences. As shown in Figure 3, closely related HECT genes were generally more similar in gene structure, particularly with respect to exon and intron number, and differed mainly in their respective exon and intron lengths. The domain architecture of HECT proteins was analyzed using the InterProScan program with a six-database annotation. A total of nine domains were identified ( Figure 4). In addition to the HECT domain, soybean HECT proteins contain additional domains in the N-terminal regions, which are assumed to be responsible for governing interactions with various substrates [7].

Chromosome Location and Duplication of Soybean HECT Genes
To determine the genomic locations of the HECT genes, the 19 soybean HECT genes were mapped on the 20 chromosomes in the soybean sequence database in Phytozome v9.1. The soybean HECT genes are randomly located on 15 of 20 chromosomes: chromosomes 1, 9, 16, 18, and 20 contain no HECT genes, chromosomes 4, 6, 7, and 17 each contain two HECT genes, while the other chromosomes each contain only one HECT gene ( Figure 5). Segmental and tandem duplication are the two primary phenomena causing gene family expansion in plants [29,30]. Additionally, in order to examine the duplication patterns of the soybean HECT genes, we identified tandem duplications based on the gene loci, and searched the Plant Genome Duplication Database (PGDD) [31] to locate segmentally duplicated pairs. No tandem duplicated pairs were detected in the 19 soybean HECT genes. However, all 19 HECT genes were found to have been involved in segmental duplication ( Figure 5). To date the duplication time of these segmentally duplicated HECT genes, we estimated the synonymous (Ks) and nonsynonymous substitution (Ka) distance, as well as the Ka/Ks ratios. The ratio of Ka/Ks for each segmentally duplicated gene pair varied from 0.13 to 0.44, with an average of 0.23 (Table 2). This analysis suggests that the duplicated HECT genes are under strong negative selection, as their Ka/Ks ratios were estimated to be <1. The approximate date of each duplication event was calculated using Ks (Table 2). We found that in each group, the two closest leaves of the soybean HECT gene phylogeny duplicated about 5-12 Mya, while the others duplicated about 32-46 Mya.

Conserved Residues in the HECT Domain
Despite the lack of information concerning the three-dimensional structure of genes in the plant HECT domain, their architectures have been described by studies of the crystal structure of the HECT domain of human HECT Nedd4 [21,25]. This makes it possible to investigate the structure and function of plant HECT domains.
We used WebLogo3 [32] to visualize the conserved residues in the HECT domain, and found that both the N-lobe and C-lobe of the HECT domain contain critical conserved residues ( Figure 6A). In addition, in order to describe these conserved residues in the context of the three-dimensional structure, we aligned the 365 HECT domain sequences with the downloaded HECT domain structure 4BBN chain A [21]. There is an abundance of conserved residues in the 365 plant HECT domain sequences (see Figure 6B, conserved residues shown in blue). In particular, almost half of the sites near the highly conserved catalytic C at site 319 in the C-lobe are highly conserved (L313, P314, T318, C319, N321, L323, L325, P326, and Y328) (for convenience, the first residue of the HECT domain is designed as site 1). Furthermore, domain logo results for the 7 HECT gene groups of 41 plant species show that in each group, almost all residues are highly conserved ( Figure S3).  Figure S3) represent the amount of informational content at each sequence position; Note that in the 3D representations (B), green represents ubiquitin (Ub), and the similarity values are mapped to a color gradient from low (red) to high rate of conservation (blue).

Expression Patterns of Soybean HECT Genes
To explore the expression patterns of these soybean HECT genes, we used RNA-seq data from SoySeq [33]. Based on the soybean RNA-seq data, 15 HECT genes were detected in all 14 tissues at the gene level ( Figure 7 and Table S3). This suggests that most HECT genes are broadly expressed during soybean development. Most HECT genes in the flowers and roots were relatively highly expressed, while those in the pod shell and seed were relatively lowly expressed ( Figure 7A). In addition, genes within each group or in different groups often had similar expression patterns in different tissues, as was the case with the expression of group II (Glyma02g38020, Glyma06g10360, Glyma14g36180) and group VI (Glyma04g00530, Glyma11g11490, Glyma12g03640) ( Figure 7A). However, unlike other genes, two genes-Glyma17g01210 in group III and Glyma06g00600 in group VI-were relatively highly expressed in the nodules than other tissues ( Figure 7A). For each tissue, the group VI HECT genes (Glyma04g00530, Glyma06g00600, Glyma11g11490 and Glyma 12g03640) were almost relatively highly expressed for all samples (except nodule) ( Figure 7B). In nodule, the Glyma17g01210 in group III had a relatively higher expression than other HECT genes ( Figure 7B).

Figure 7.
Heatmap of expression profiles of soybean HECT genes in 14 tissues. Normalized transcriptional levels were obtained from Severin et al. [33]. The RNA-seq relative expression data of 14 tissues was used to reconstruct the expression patterns of soybean genes. Color in the heatmaps represents Z-score of RPKM values of soybean HECT genes calculated per row (gene) (A) and per column (tissue) (B), respectively. Z-scores calculated per row (A) were used to show the changes of expression of a gene across different tissues, and Z-scores calculated per column (B) were used to rank genes for each sample. The sources of the samples are as follows: young leaf, flower, one cm pod, pod shell 10DAF (day after flowering), pod shell 14DAF, seed 10DAF, seed 14DAF, seed 21DAF, seed 25DAF, seed 28DAF, seed 35DAF, seed 42DAF, root, and nodule.

Discussion
Arabidopsis thaliana HECT family genes play crucial roles in various plant developmental and physiological processes [3,6,7,34], including trichome development [7], genome endoreduplication [6], and leaf senescence [3]. However, this gene family has not been studied in soybean. In this study, we performed a comprehensive analysis of the soybean HECT gene family, including an analysis of the genes' phylogeny, gene structure, domain architecture, chromosome location, duplication patterns, conserved residues, and expression profiles.
In this study, 19 HECT genes were identified in the soybean genome, which is 2.7 times the number present in Arabidopsis thaliana. However, a recent study found there to be 15 HECT genes in soybean [22]. Our results revealed that there are four more HECT genes (group I: Glyma05g26360, group II: Glyma06g10360, Glyma14g36180, and group V: Glyma19g37310) in soybean genome than previously estimated ( Figure S4). There are two possible explanations for this discrepancy. First, the latest update of the soybean genome database includes a number of newly assembled and imported genes. Second, the search methods implemented in our study and differed from those used in the previous study. We used the combined method of HMMER-Blast-InterProScan, while the previous study used TBlastN.
The results of the phylogenetic analysis of 365 plant HECT genes from 41 plant species divided the soybean HECT genes into subfamilies similar to those described in previous reports [7,22,23]. The divisions were based on corresponding HECT domain sequence homology. According to the phylogenetic relationships between the HECT genes in Glycine max, Medicago truncatula, Phaseolus vulgaris, and Arabidopsis thaliana (outgroup), soybean HECT genes can be divided into seven groups. Compared with previous study [22], subfamily IV HECT genes were absent in these plants. Subfamily V (3 genes) corresponds to group I (2 genes) and II (4 genes) and subfamily I (6 genes) corresponds to group VI (4 genes) and VII (2 genes) in this study. Subfamily II (1 gene) corresponds to group V (2 genes), subfamily III (3 genes) corresponds to group IV (3 genes), and subfamily VI (2 genes) corresponds to group III (2 genes). Except for group II, all soybean HECT gene groups have orthologous genes in Arabidopsis thaliana. This is consistent with the results of a recent plant HECT study [22], which indicated that Arabidopsis thaliana HECT group II (UPL8 in their study) was absent. Members of each group usually have identical gene structures and domain architectures, which suggests that they may interact with identical or similar substrates.
Segmental duplication, tandem duplication, and transposition events are the three principal evolutionary patterns of gene duplication that cause gene family expansion [30,[35][36][37]. Of these, segmental duplication events happen most frequently in plants, because most plants are diploidized polyploids and retain numerous duplicated chromosomal blocks within their genomes [30]. In this study, we found that all soybean HECT genes are located in duplicated blocks, suggesting that segmental duplication contributed significantly to the expansion of the soybean HECT gene family. A previous study has shown that the soybean genome has undergone two genome duplication events, at 58 and 13 Mya [28]. By estimating the duplication date of the duplicated pairs of soybean HECT genes, we postulate that the paralogous genes in group II, IV, and VI originate from both the ancient and recent duplication event, while in group I, III, V, and VII they originate from the recent duplication event.
Analysis of the expression patterns of these soybean genes in 14 tissues showed that most HECT genes were relatively highly expressed in flowers and roots. However, Glyma06g00600 and Glyma17g01210 were highly expressed in the nodules. From this, we inferred that the highly expressed HECT genes in the flowers may be involved in the degradation of genes relating to flowering, via ubiquitination during soybean flowering stage. Additionally, the results suggested that the highly expressed genes in roots and nodules may directly or indirectly control the expression of nitrogen-fixing genes during symbiotic nitrogen fixation. Previous studies have revealed that Arabidopsis thaliana AT4G38600/UPL3 restricts the rounds of genome endoreduplication and cell branching that occur during trichome development [7], and AT4G12570/UPL5 regulates leaf senescence through the degradation of AT4G23810/WRKY53, a transcription factor that acts positively in leaf senescence [3]. In our analysis, the soybean genes orthologous to Arabidopsis thaliana AT4G38600/UPL3 are four paralogous genes in group VI. These four genes were all expressed in soybean, but display different expression patterns in different tissues. Except for Glyma06g00600, which is expressed relatively highly in nodules, the other three genes are relatively highly expressed in roots and flowers. This may be caused by mutations accumulated after the two segmental duplication events, especially the latest duplication events. The soybean genes orthologous to Arabidopsis thaliana AT4G12570/UPL5 are two paralogous genes in group III. Glyma17g01210 was also highly expressed in nodules, while Glyma7g39546 was not expressed. The differential expression of paralogous genes of the same group indicates that the HECT genes in soybean may have the same or similar function as their orthologues in Arabidopsis thaliana; however, they may have evolved functional differences.
A recent report showed that ubiquitin-proteasome system (UPS) dependent proteolysis of the two transcription factors, AT5G41315/GL3 and AT1G63650/EGL3, is mediated by AT4G38600/UPL3 [34]. GLABROUS 3 (GL3) and ENHANCER OF GLABROUS 3 (EGL3), which function as positive regulators of trichome development, interact with the N-terminal ARM domains of UPL3 via their C-terminal domains. Moreover, other recent studies have revealed that the highly conserved residues in the three-dimensional structures of the HECT domain are essential for the ubiquitylation of proteins [21,[25][26][27]. Our analysis of 365 plant HECT domains shows that many highly conserved residues are present, suggesting that these conserved residues still play key roles in structural maintenance, and are involved in plant ubiquitination processes. Further functional analysis of these genes would better our understanding of the functional roles of HECT genes in soybean and other plants.

Identification of HECT Genes in Soybean
The soybean genome database (release v1.1) was downloaded from Phytozome v9.1 [28]. The HMM profile of the HECT domain (PF00632) was obtained from Pfam [38,39]. To identify potential HECT genes in soybean, the HECT domain profile PF00632 was used as a query for searching the soybean genome database, using the HMMER3.1 [40,41] program, hmmsearch, with its default parameters (E-value < 10 −5 ). To obtain the complete soybean HECT genes, the HMMER search results were used as queries in searching the soybean genome database a second time, using the BlastP and tBlastN programs [42] with their default parameters (E-value < 10 −5 ). All hits were subsequently verified using the InterProScan program [43] to confirm the presence of the HECT domain. Finally, the Pfam [38,39], PROSITE [44], SMART [45], SUPERFAMLIY [46], PANTHER [47], and Gene3D [48] databases were used to manually determine the domain architecture of each HECT gene. Sequences with an incomplete HECT domain or fewer than 300 amino acids were excluded from the final sequence dataset. In addition, similar analyses of HECT genes were performed for the other 40 plant species in Phytozome v9.1.

Phylogenetic Analysis and Gene Structure
The retrieved protein sequences were aligned using MUSCLE [49] with its default parameters, and MAFFT [50,51] (L-INS-i strategy). The alignment was filtered for informative sites using trimal v1.4, with the option-gappyout [52]. ProtTest v3.4 [53] was used to estimate the most appropriate model of amino acid substitution using both Akaike information and Bayesian information criterion, which together suggested that the Jones-Taylor-Thornton and γ-distributed site rates (JTT + G) model was the best-fit model. The filtered alignment was then used in the phylogenetic analysis, which in turn utilized maximum likelihood (ML) methods implemented in PhyML3.0 [54]. The analysis included 4 rate substitution categories, the JTT substitution model, a BIONJ starting tree, and 100 bootstrap repetitions. The final alignment was carried out based on the HECT domain alone, using the MAFFT (G-INS-i strategy). The Neighbor-Joining (NJ) trees of full-length amino acids sequences were constructed using the MEGA6 package with 1000 bootstrap repetitions under the JTT substitution model. Phylogenetic trees were visualized and annotated using the Interactive Tree of Life v2 Web server [55] and EvolView [56]. The structures of the HECT genes were made using the Gene Structure Display Server [57], via comparisons of the coding sequences with their corresponding genomic sequences.

Chromosome Location and Duplication
Information about the chromosome location of the HECT genes was obtained from the Phytozome v9.1. Duplication patterns of the soybean HECT genes were inferred based on their locations in the soybean genome. Tandem duplicated genes were defined as adjacent homologous genes located on the same chromosome, and separated by no more than five genes in a 100-kb region [58]. Segmentally duplicated genes were defined as two genes located on duplicated chromosomal blocks [29]. To determine whether an HECT gene was involved in segmental duplication, the syntenic blocks of each HECT gene were searched for in the Plant Genome Duplication Database [31], and visualized using Circos-0.67 [59].

Calculation of Synonymous (Ks) and Nonsynonymous Substitution (Ka) to Date Duplication Events
Synonymous (Ks) and nonsynonymous substitution (Ka) rates were calculated according to methods used in previous studies [29,58]. First, MUSCLE v3.8.31 [49] (with default parameters) was used to construct pairwise alignments of the protein sequences of the duplicated gene pairs. The coding sequence alignments based on these amino acid sequence alignments were guided by trimal v1.4 [52], with the option-backtrans. Then, Ks and Ka were estimated using the CODEML program in PAML (Phylogenetic Analysis by Maximum Likelihood) v4.8 [60]. For each gene pair, the approximate date of the duplication event was calculated using the mean Ks values from T = Ks/2λ, in which the mean synonymous substitution rate ( λ ) for soybean is 6.1 × 10 −9 [61].

Logos of HECT Domains and Three-Dimensional Representations of Domain Alignment
Logos of the HECT domains were generated using WebLogo3 online [32] (using the default parameters). Three hundred and sixty-five HECT domain sequences with the downloaded HECT NEDD4 (neural precursor cell expressed developmentally down-regulated protein 4) domain structure (PDB ID: 4BBN, chain A) [21] were aligned using the VMD (Visual Molecular Dynamics) MultiSeq alignment [62,63] method (coloring method: Sequence Similarity BLOSUM 90).

Expression Analyses
RNA-Seq data were downloaded from SoySeq [33] and used to analyze the expression patterns of HECT genes in soybean. These transcript data were obtained from 14 tissues, including underground tissues (root and nodule), seed development (seed 10DAF, seed 14DAF, seed 21DAF, seed 25DAF, seed 28DAF, seed 35DAF, and seed 42DAF), and aerial tissues (leaf, flower, pod-shell 10DAF, pod shell 14DAF, and one cm pod). The expression data were normalized RPKM (reads per kilobase per million mapped reads), and the heatmap was drawn in R.