Genome-Wide Identification and Characterization of the Vacuolar H+-ATPase Subunit H Gene Family in Crop Plants

The vacuolar H+-ATPase (V-ATPase) plays many important roles in cell growth and in response to stresses in plants. The V-ATPase subunit H (VHA-H) is required to form a stable and active V-ATPase. Genome-wide analyses of VHA-H genes in crops contribute significantly to a systematic understanding of their functions. A total of 22 VHA-H genes were identified from 11 plants representing major crops including cotton, rice, millet, sorghum, rapeseed, maize, wheat, soybean, barley, potato, and beet. All of these VHA-H genes shared exon-intron structures similar to those of Arabidopsis thaliana. The C-terminal domain of VHA-H was shorter and more conserved than the N-terminal domain. The VHA-H gene was effectively used as a genetic marker to infer the phylogenetic relationships among plants, which were congruent with currently accepted taxonomic groupings. The VHA-H genes from six species of crops (Gossypium raimondii, Brassica napus, Glycine max, Solanum tuberosum, Triticum aestivum, and Zea mays) showed high gene structural diversity. This resulted from the gains and losses of introns. Seven VHA-H genes in six species of crops (Gossypium raimondii, Hordeum vulgare, Solanum tuberosum, Setaria italica, Triticum aestivum, and Zea mays) contained multiple transcript isoforms arising from alternative splicing. The study of cis-acting elements of gene promoters and RNA-seq gene expression patterns confirms the role of VHA-H genes as eco-enzymes. The gene structural diversity and proteomic diversity of VHA-H genes in our crop sampling facilitate understanding of their functional diversity, including stress responses and traits important for crop improvement.


Introduction
The large central vacuole is one of the most distinctive organelles that are essential for plant viability [1]. Central vacuoles function as reservoirs for ions and metabolites, which allow buffering of changes in nutrient and toxic components, including toxic metals and excessive salts that plants frequently encounter. More importantly, the vacuoles are necessary for plant growth and development. In fact, the expansion of a plant cell is achieved by osmotically driving water influx into the vacuole that, in combination with the cell wall, provides the driving force for cell growth and reversible vacuolar

Protein Sequence Alignments and Phylogenetic and Motif Analyses
Aligned protein sequences of the identified VHA-H genes revealed they were highly conserved, with the C-terminal domain being shorter and more conserved than the N-terminal domain ( Figure 1). The linker between the two domains in all of the identified VHA-H genes contained six amino acids in length. A total of 15 conserved motifs of the protein sequences encoded by the VHA-H genes were revealed using MEME. The amino acid compositions of these motifs are presented in Figure 2. These conserved motifs encoded by the VHA-H genes in most of the main crops (except for BnVHA-H3 and BnVHA-H5 of Brassica napus) were highly consistent in types, numbers, and distribution modes ( Figure 2). They shared the same arrangement pattern of motifs 1-14. However, BnVHA-H3 did not contain motif 7, which was replaced with motif 15 at this position, while BnVHA-H5 lacked motifs 6, 8, and 12. These motif arrangements of these two genes all occurred in the N-terminal domain ( Figure 2). Motif 11 consisted of 11 amino acids, including a linker region of six amino acids in length ( Figure 1, Figure 2). The E-values of motifs 1 and 2 were the highest among all of the 15 motifs and the C-terminal domain was composed mainly of both motifs 1 and 2, supporting again the C-terminal domain being more evolutionarily conserved than the N-terminal domain (Table 2, Figure 2). Overall, the positional changes of the motifs of all of the identified genes were consistent within the positions of both the N-terminal and the C-terminal domains ( Figure 3). sativa), while the numbers of exons were 11 (17 genes in 8 species), 12 (one gene in 4 species including Gossypium raimondii, Oryza sativa, Setaria italica, and Sorghum bicolor), or 13 (one gene in Brassica napus). Analysis of the protein sequences and their physical and chemical properties revealed that the lengths of the deduced polypeptides ranged from 448 (Glycine max) to 495 amino acids (Oryza sativa), and the PI values of proteins were mostly around 67 except for one rice gene (OsiVHA-H), which had the highest PI value of 9.07. The N-terminal domain is located between two green arrows. The C-terminal domain is indicated inside a pink frame. The linker region of six amino acids highlighted with the orange background is a portion of motif 11, which is inside a black frame. The N-terminal domain is located between two green arrows. The C-terminal domain is indicated inside a pink frame. The linker region of six amino acids highlighted with the orange background is a portion of motif 11, which is inside a black frame.
The phylogenetic tree showed that the 23 protein sequences were clustered into two clades ( Figure 3), one containing dicots and the other monocots (Poaceae). The dicots included Arabidopsis thaliana, Brassica napus, Gossypium raimondii, Glycine max, Solanum tuberosum, and Beta vulgaris, while the monocots contained Triticum aestivum, Hordeum vulgare, Oryza sativa, Sorghum bicolor, Zea mays, and Setaria italica. The clade of monocots was further divided into two clusters containing plants belonging to Pooideae (wheat, barley, and rice) and Panicoideae (sorghum, corn, and millet), respectively. Three genes of Triticum aestivum were revealed in one monophyletic group, while two genes of Zea mays and one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree.
These motif arrangements of these two genes all occurred in the N-terminal domain ( Figure 2). Motif 11 consisted of 11 amino acids, including a linker region of six amino acids in length (Figures 1,2). The E-values of motifs 1 and 2 were the highest among all of the 15 motifs and the C-terminal domain was composed mainly of both motifs 1 and 2, supporting again the C-terminal domain being more evolutionarily conserved than the N-terminal domain (Table 2, Figure 2). Overall, the positional changes of the motifs of all of the identified genes were consistent within the positions of both the Nterminal and the C-terminal domains ( Figure 3). The phylogenetic tree showed that the 23 protein sequences were clustered into two clades ( Figure  3), one containing dicots and the other monocots (Poaceae). The dicots included Arabidopsis thaliana, Brassica napus, Gossypium raimondii, Glycine max, Solanum tuberosum, and Beta vulgaris, while the monocots contained Triticum aestivum, Hordeum vulgare, Oryza sativa, Sorghum bicolor, Zea mays, and Setaria italica. The clade of monocots was further divided into two clusters containing plants belonging These results clearly showed that many alternative splicing sites were located on the N-terminal domain. The 22 identified VHA-H genes from 11 species of main crops ranged in size from 3301 bp (BnVHA-H3) to 8112 bp (BnVHA-H5), both corresponding to Brassica napus. The variations in size among different genes were mainly due to the number and length of introns. For example, the gene BnVHA-H5 was 4144 bp in length and the length of its transcript was 2075 bp, while the gene of BnVHA-H6 was 3313 bp in length and its transcript was 2049 bp in length. Both transcripts of the genes BnVHA-H5 and BnVHA-H6 were similar in length and they all had 10 introns, while the variation (813 bp) between the lengths of these two genes was due to intron length. The gene GrVHA-H1 was 4522 bp in length with 10 introns, while GrVHA-H2 was 5649 bp but with 11 introns and their transcripts were 1392 bp and 1446 bp in length, respectively. This showcases variation in the lengths of the genes caused by the number and lengths of introns. It is noteworthy that 5 of the 6 genes of Brassica napus shared similar lengths except for one (BnVHA-H5), which had almost twice the length of any of the other five genes from the same species, mainly due to one extremely long intron. An analysis of phases of introns revealed that each gene contained the three phases known to disrupt (phases 1 and 2) and not disrupt (phase 0) codons ( Figure 3). The largest proportion of intron phases of all the genes was phase 0 (60%), followed by phase 2 (34%) and phase 1 (6%). Both the Nterminal and the C-terminal domains in these 12 plants shared similar intron distribution patterns. For example, the phases of the first three introns in the N-terminal domain shared the same patterns (i.e., phase 0, phase 0, and phase 1) in all 12 species of plants, while the C-terminal domain ended with the  to Pooideae (wheat, barley, and rice) and Panicoideae (sorghum, corn, and millet), respectively. Three genes of Triticum aestivum were revealed in one monophyletic group, while two genes of Zea mays and one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. to Pooideae (wheat, barley, and rice) and Panicoideae (sorghum, corn, and millet), respectively. Three genes of Triticum aestivum were revealed in one monophyletic group, while two genes of Zea mays and one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. genes of Triticum aestivum were revealed in one monophyletic group, while two genes of Zea mays and one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. genes of Triticum aestivum were revealed in one monophyletic group, while two genes of Zea mays and one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. genes of Triticum aestivum were revealed in one monophyletic group, while two genes of Zea mays and one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. genes of Triticum aestivum were revealed in one monophyletic group, while two genes of Zea mays and one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. one gene of Sorghum bicolor formed another monophyletic group. In the clade of dicots, most species containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree. containing more than one VHA-H gene (Gossypium raimondii, Glycine max, and Solanum tuberosum) were revealed as monophyletic except for Brassica napus, which had six VHA-H genes with Arabidopsis embedded. The phylogenetic trees derived from Bayesian analysis (data now shown) showed largely congruent topologies to those revealed by the neighbor-joining tree.   -H5), both corresponding to Brassica napus. The variations in size among different genes were mainly due to the number and length of introns. For example, the gene BnVHA-H5 was 4144 bp in length and the length of its transcript was 2075 bp, while the gene of BnVHA-H6 was 3313 bp in length and its transcript was 2049 bp in length. Both transcripts of the genes BnVHA-H5 and BnVHA-H6 were similar in length and they all had 10 introns, while the variation (813 bp) between the lengths of these two genes was due to intron length. The gene GrVHA-H1 was 4522 bp in length with 10 introns, while GrVHA-H2 was 5649 bp but with 11 introns and their transcripts were 1392 bp and 1446 bp in length, respectively. This showcases variation in the lengths of the genes caused by the number and lengths of introns. It is noteworthy that 5 of the 6 genes of Brassica napus shared similar lengths except for one (BnVHA-H5), which had almost twice the length of any of the other five genes from the same species, mainly due to one extremely long intron.

Structures of the VHA-H Genes
An analysis of phases of introns revealed that each gene contained the three phases known to disrupt (phases 1 and 2) and not disrupt (phase 0) codons ( Figure 3). The largest proportion of intron phases of all the genes was phase 0 (60%), followed by phase 2 (34%) and phase 1 (6%). Both the N-terminal and the C-terminal domains in these 12 plants shared similar intron distribution patterns. For example, the phases of the first three introns in the N-terminal domain shared the same patterns (i.e., phase 0, phase 0, and phase 1) in all 12 species of plants, while the C-terminal domain ended with the sequence of phase 2, phase 0, phase 0, phase 0, phase 0, and phase 0, in that order. Furthermore, the intron distribution patterns of the different genes in the same species were also highly conserved.

Splice Variants of VHA-H Genes
Due to the existence of alternative splicing sites, a total of 7 genes in 6 species of crops contained multiple transcripts (Tables 3 and 4). These transcripts generated a total of 26 putative translation products making up 63.45% of the total (41) putative translation products inferred from the 22 identified VHA-H genes in the 11 crop species. The alternative splicing events occurred mainly in the N-terminal domain, which is the main functional region of the enzyme. There was a total of 51 splicing sites, 27 of which derived from exon skipping, 9 from alternative 5 splice sites, 6 from alternative 3 splice sites, and 8 from mutually exclusive exon events ( Table 3). The exon skipping events occurred on 5 out of the 7 genes containing multiple transcripts and on many exons. The alternative splicing sites occurring in the UTRs were identified in 6 out of the 7 genes and accounted for more than half of all splicing sites with some transcripts having only the alternative splicing in the UTR. On the one hand, more transcripts were revealed in species with fewer genes, e.g., barley had only one gene (HvVHA-H) but with six transcripts. On the other hand, species with more genes had fewer transcripts. For example, canola had 6 genes but with each gene having only one transcript. Furthermore, crops with wide planting areas had more than one gene and each had multiple splice variants. For example, maize had two genes (ZmVHA-H1 and ZmVHA-H2) with each gene having three and five transcripts, respectively. Table 3. Splice variants of 7 VHA-H genes identified in 6 species of crops. Gene names are the same as those in Table 1. Alternative splicing sites occurring in the 5 or 3 UTRs are indicated in parentheses following the spliced exons.

Gene Transcript Ensembl Transcript ID Predicted Amino Acid Length (aa) Spliced Exon Status
GrVHA-H2

Analysis of Cis-acting Elements of VHA-H Promoters
The cis-acting elements of the promoter regions of the 23 VHA-H genes were analyzed (Figure 4). A total of 16 major cis-acting elements were revealed and categorized into three groups. The first group was involved in development, including circadian and AuxRR-core. The second group contained widely distributed phytohormone regulators such as P-box, TCA-element, GARE-motif, TGA-element, TATC-box, TGACG-motif, CGTCA-motif, and ABRE. The third group contained environmental stress-related elements which were abundantly present in the promoter regions, mainly LTR, WUN-motif, GC-motif, ARE, MBS, and TC-rich repeats. Among them, the most abundant elements were ABRE and ARE, which respond to hormones and stress, respectively.

Splice Variants of VHA-H Genes
Due to the existence of alternative splicing sites, a total of 7 genes in 6 species of crops contained multiple transcripts (Tables 3,4). These transcripts generated a total of 26 putative translation products making up 63.45% of the total (41) putative translation products inferred from the 22 identified VHA-H genes in the 11 crop species. The alternative splicing events occurred mainly in the N-terminal domain, which is the main functional region of the enzyme. There was a total of 51 splicing sites, 27 of which derived from exon skipping, 9 from alternative 5′ splice sites, 6 from alternative 3′ splice sites, and 8 from mutually exclusive exon events ( Table 3). The exon skipping events occurred on 5 out of the 7 genes containing multiple transcripts and on many exons. The alternative splicing sites occurring in the UTRs were identified in 6 out of the 7 genes and accounted for more than half of all splicing sites with some transcripts having only the alternative splicing in the UTR. On the one hand, more transcripts were revealed in species with fewer genes, e.g., barley had only one gene (HvVHA-H) but with six transcripts. On the other hand, species with more genes had fewer transcripts. For example, canola had 6 genes but with each gene having only one transcript. Furthermore, crops with wide planting areas had more than one gene and each had multiple splice variants. For example, maize had two genes (ZmVHA-H1 and ZmVHA-H2) with each gene having three and five transcripts, respectively.

Analysis of Cis-acting Elements of VHA-H Promoters
The cis-acting elements of the promoter regions of the 23 VHA-H genes were analyzed (Figure 4). A total of 16 major cis-acting elements were revealed and categorized into three groups. The first group was involved in development, including circadian and AuxRR-core. The second group contained

Tissue-Specific Expression Patterns of VHA-H Genes
Since the RNA-seq data of the VHA-H genes in 11 crops were incomplete in the EnsemblPlants database, we only obtained eight genes from six crop species to conduct the expression analysis. Most of these genes were highly expressed in roots and leaves, with more expressions in roots than in leaves except for those in maize ( Figure 5). These VHA-H genes are evidently expressed higher in Arabidopsis and Setaria italica than in other species. TATC-box, TGACG-motif, CGTCA-motif, and ABRE. The third group contained environmental stressrelated elements which were abundantly present in the promoter regions, mainly LTR, WUN-motif, GC-motif, ARE, MBS, and TC-rich repeats. Among them, the most abundant elements were ABRE and ARE, which respond to hormones and stress, respectively.

Tissue-specific Expression Patterns of VHA-H Genes
Since the RNA-seq data of the VHA-H genes in 11 crops were incomplete in the EnsemblPlants database, we only obtained eight genes from six crop species to conduct the expression analysis. Most of these genes were highly expressed in roots and leaves, with more expressions in roots than in leaves except for those in maize ( Figure 5). These VHA-H genes are evidently expressed higher in Arabidopsis and Setaria italica than in other species.

Discussion
Plant V-ATPase is a primary active proton pump of the endomembrane system [3]. It has multiple functions as a 'house-keeping' and stress response enzyme [27]. The VHA-H is a small subunit connecting the V1 and Vo complexes of the V-ATPase that is essential to the activity and stability of V-ATPase [22]. However, the VHA-H lacks genetic identification information in major crops and the evolutionary relationships of this gene family have not been investigated. Here we used well established bioinformatics and phylogenetic analysis methodology to identify and characterize the VHA-H genes of 11 main crops to illustrate both the structural diversity of the identified VHA-H genes and the proteomic diversity of the putative transcripts. These results will further our understanding of stress responses for crop improvement.

Identification of V-ATPase Subunit H Genes in the Main Crops
A total of 22 VHA-H genes were identified in 11 major crops with varied numbers of genes ranging from 1 to 6 among different plant species (Table 1). It was suggested previously that the number of genes encoding each of the V-ATPase subunits generally varied among different species of plants, suggesting the species-specific functions such as the acidification in the cell vacuole [10]. Furthermore, the injection of abscisic acid (ABA) significantly increased the citric acid content, accompanied simultaneously by the evident induction of the transcription levels of multiple subunits of the V-

Discussion
Plant V-ATPase is a primary active proton pump of the endomembrane system [3]. It has multiple functions as a 'house-keeping' and stress response enzyme [27]. The VHA-H is a small subunit connecting the V 1 and V o complexes of the V-ATPase that is essential to the activity and stability of V-ATPase [22]. However, the VHA-H lacks genetic identification information in major crops and the evolutionary relationships of this gene family have not been investigated. Here we used well established bioinformatics and phylogenetic analysis methodology to identify and characterize the VHA-H genes of 11 main crops to illustrate both the structural diversity of the identified VHA-H genes and the proteomic diversity of the putative transcripts. These results will further our understanding of stress responses for crop improvement.

Identification of V-ATPase Subunit H Genes in the Main Crops
A total of 22 VHA-H genes were identified in 11 major crops with varied numbers of genes ranging from 1 to 6 among different plant species (Table 1). It was suggested previously that the number of genes encoding each of the V-ATPase subunits generally varied among different species of plants, suggesting the species-specific functions such as the acidification in the cell vacuole [10]. Furthermore, the injection of abscisic acid (ABA) significantly increased the citric acid content, accompanied simultaneously by the evident induction of the transcription levels of multiple subunits of the V-ATPase, including the subunit of H [10]. These results demonstrated that the VHA-H genes carry out specific functions in different species of plants. It is noteworthy that one VHA-H gene was identified in Oryza sativa of the Indica Group (Table 3) but no VHA-H gene was found in Japonica rice, indicating Indica rice had higher genetic diversity than Japonica rice. This observation agrees with previously reported results [28]. More importantly, the lengths of the ORFs, the amino acid sequences, the isoelectric points, and the number of exons of these 22 VHA-H genes were highly consistent among all species of plants (Table 1), indicating that members in the VHA-H gene family are evolutionarily highly conserved in the crop plants we surveyed. This conservation is presumably important and necessary for maintaining the specific functions of these genes across a large taxonomical spectrum.

Protein Sequence Alignments and Phylogenetic and Motif Analyses
Alignments of the amino acid sequences of the VHA-H genes showed that the sequence variations were in the middle portions of the N-terminal domain (Figure 1). These results are consistent with those of the gene structural analysis, which showed more variations in the N-terminal domain than the C-terminal domain (Figure 3). The conservative nature of the C-terminal domain was previously reported based on phylogenetic studies of the N-terminal and C-terminal domains of the V-ATPase proteolipid [29]. By using the N-terminal and C-terminal domains as separate entries for phylogenetic reconstruction, Gogarten [29] suggested that the front and back halves of the V-ATPase proteolipid were derived from a gene duplication that occurred after the bifurcation of the Sulfolobus acidocaldarius sequence and before the radiation of the eukaryotes. Furthermore, the results of the motif analysis showed clearly that the gains and losses of motifs all occurred in the N-terminal domain (Figure 2), suggesting again the more conservative nature of the C-terminal domain than the N-terminal domain of the V-ATPase. Our results revealed that among the 15 conserved motifs of the protein sequences putatively encoded by the VHA-H genes, motifs 1-14 were highly consistent in types, numbers, and distribution modes across main crops (except for BnVHA-H3 and BnVHA-H5 in Brassica napus), most notably sharing the same arrangement patterns. The E-values of motifs 1 and 2 were the highest among all of the 15 motifs. Since the C-terminal domain is mainly made up of both motifs 1 and 2, E-values again suggest that the C-terminal domain is more conserved than the N-terminal domain (Table 2, Figure 2). Overall, the observed positional changes of the motifs of all of the identified genes were consistent within the positions of both the N-terminal and the C-terminal domains (Figure 3). It is worth noting that the linker region of six amino acids connecting both the N-terminal and C-terminal domains was entirely included in motif 11, indicating that the upstream region of motif 11 belongs to the N-terminal domain (Figure 1, Figure 2).
The V-ATPase is an evolutionarily conserved and ancient enzyme with remarkably diverse eukaryotic functions [30]. Therefore, it has been speculated that phylogenetic analysis of the VHA-H gene family could help reconstruct evolutionary relationships among widely different organisms. In fact, the V-ATPase subunits have been used in phylogenetic analyses of various groups of organisms [9].
Our phylogenetic analysis of the VHA-H gene family showed the separation of the monocots and the dicots into distinct monophyletic groups harboring various clusters that were congruent with currently recognized taxonomic groupings (Figure 2). The phylogenetic relationships of gramineae also largely agree with those reported previously based on the evolutionary studies of the subunits of VHA-H genes in plants [9]. Furthermore, the evolutionary studies of the VHA-H genes suggested clear distinctions of various orders of insects (e.g., Diptera, Lepidoptera, and Orthoptera) [31]. Collectively, results suggest that the VHA-H gene is an evolutionarily ancient genetic marker that appeared before species diversification.

Gene Structural Diversity
The structural diversity generated by losses or gains of introns within gene families is one of many evolutionary mechanisms that promote variability [32]. However, intron positions and intron phases of the 12-oxo-phytodienoic acid reductase (OPR) genes were found to be well-conserved, with some introns being conserved in all plant lineages [33]. It has been speculated that lineage-specific intron loss events might have occurred during the expansion and structural evolution of OPR genes, which generated gene structural diversity [33,34]. Our results showed that the number and phases of introns were conserved in individual species while exhibiting inter-species diversity (Figure 3).
For example, identical numbers and phases were observed in the introns of three species of plants, Glycine max (GmVHA-H1 and GmVHA-H2), Zea mays (ZmVHA-H1 and ZmVHA-H2), and Triticum aestivum (TaVHA-H1, TaVHA-H2, and TaVHA-H3). However, the numbers and phases of introns were different among different species of crops. For example, the fourth intron was phase 1 in Zea mays, phase 0 in Triticum aestivum, and phase 2 in Arabidopsis (Figure 3). It has been suggested that the presence of introns in the same positions in orthologous genes from distant eukaryotes may reflect evolutionary conservation rather than parallel gain [35]. Our analysis of phases of introns showed that the first 3 introns of the N-terminal domain and the last 5 introns of the gene exon/intron structure were conserved in all 11 species of crops, which shared the same patterns of phases. In contrast, the introns in the middle portion of the N-terminal domain were diverse, showing mostly phases 1 and 2 in various orders ( Figure 3). Furthermore, introns with phase 0 were commonly located in more conserved areas, e.g., in the C-terminal domain (Figure 3). The C-terminal domain was more conserved and shorter than the N-terminal domain because many alternative splicing sites were located on the N-terminal domain (Figure 3). These results agree with those previously reported showing that introns with phase 0 were normally located in more conserved portions of genes than introns with phases 1 or 2 [36]. Previous studies demonstrated that conserved intron positions were found within a variety of ancient protein modules, suggesting that the initial function of exons did not represent the boundaries of functional protein modules, but that the domain itself was assembled from exons [37]. These results suggest that the phases of introns and the numbers of the VHA-H genes were well-conserved and that some introns (i.e., those located in the middle portion of the N-terminal domain) maintained their specificity.
Gene structural analysis also showed multiple VHA-H genes in some species, with individual genes showing varied gene structures. For example, both ends of the VHA-H gene in Brassica napus maintained equally conserved structures. However, the numbers and phases of the introns largely changed in the middle portion of the gene in comparison to other crops ( Figure 3) and its genetic structure was rather diverse in the N-terminal domain, when compared to the C-terminal domain (Table 1, Figure 3). Our results revealed the diversity of both exons and introns of the N-terminal domain. Since the N-terminal domain is the major functional domain of the VHA-H gene product, the diversity of the VHA-H gene structure revealed in its N-terminal domain suggest the existence of functional diversity. Previous studies showed that single intron loss and gain contributed to the diversification of gene structure, and consequently, functional diversity and divergence, during the evolution of the NAD(H) kinase genes in plants [32]. Results of our analysis of gene structures indicated that the changes in the number and position of exons mainly occurred in the N-terminal domain, while the C-terminal domain was left largely unchanged ( Figure 3). It was previously reported that the activation and functions of the V-ATPase complex do not require the first 179 amino acids of the N-terminal domain and the minimal region capable of activating the V-ATPase contains 174 amino acids at positions 180-353 of the N-terminal domain [25]. These results suggest that the changes in the N-terminal domain cause exon variations, leading to changes in functional specificity of the V-ATPase enzyme.

Splice Variants
Alternative splicing has been considered as one of the most important mechanisms contributing to the protein structural and functional diversity [36,38]. Alternative splicing is involved in many physiological processes of plants, including the response to biotic and abiotic stresses [39][40][41][42]. Specifically, the alternative splicing events were increased in Arabidopsis under salt stress probably due to an acclimation response [39]. In some cases, alternative splicing may alter the domain architecture of kinases, influence their subcellular localizations, and enhance the ability to cope with stress via transcriptomic plasticity [42]. Our studies showed that alternative splicing events occurred mainly in the N-terminal domain, which is the main functional region of the VHA-H gene product, showcasing higher variability relative to the C-terminal domain ( Table 2). We also showed that the VHA-H genes in crops of wide planting area or with more than one family member contained multiple splice variants (Table 4). For example, there were two VHA-H genes (ZmVHA-H1 and ZmVHA-H2) in Zea mays with each gene having three and five transcripts, respectively. Based on these results, we speculate that alternative splicing events occurring in the N-terminal domain, the functional unit of the enzymes, result in diverse functions needed to cope with varied environments.
Although the detailed functions of the alternative splicing variants in VHA-H genes have not been investigated in crops, studies in mouse and zebrafish suggested that two isoforms derived from gene ATP6V1H encoding the subunit H of V-ATPase were due to alternative splicing [16]. The amounts and the ratios of these two transcripts varied greatly among various types of tissues or cells suggesting that the selective expression of these two splice variants had different effects on the craniofacial development of zebrafish [16]. Our study of the 11 major crops showed that the transcripts of the VHA-H genes were diverse with some genes having multiple transcripts. For example, one VHA-H gene (HvVHA-H) of Hordeum vulgare had six transcripts, which is the largest number of transcripts uncovered in this study. It is also important to point out that more than half of the alternative splicing sites occurred in the UTRs ( Table 3). The UTR sequences are known to play crucial roles in the post-transcriptional regulation of gene expression, including modulation of mRNA transport out of the nucleus [43], translational efficiency, and subcellular localization and stability [44]. Thus, alternative splicing of the UTRs could play an important role in the transcription process in the major crops we investigated. Furthermore, only the 5 UTR but not the 3 UTR in maize contained a cleavage site, while intron retention occurred in the 5 UTR in the ZmVHA-H2.4 transcript, making the 5 UTR longer (Table 3). It has been reported that long and various 5 UTRs provide more and different regulatory elements that might influence the efficiency of transcription, translation, and even the functions of the protein products of a single gene [45]. In wheat, only the 3 UTR had alternative splicing sites, while alternative splicing occurred in both 3 UTR and 5 UTR in several other species (e.g., barley, millet, and potato, Table 3). It was previously demonstrated that both the 5 and 3 UTRs in pre-mRNAs play a variety of roles in controlling eukaryotic gene expression, including translational modulation [46]. For example, it was reported that both mRNA splicing and AU rich elements in the 3 UTR can inter-dependently influence β-catenin mRNA stability [47]. These results suggest that the diversity of transcripts in major crops may lead to protein diversity, consequently, promoting the functional diversity of the VHA-H genes.

Cis-Acting Elements of the VHA-H Promoters
As an ancient eco-enzyme, V-ATPase plays an important role in plant development and adaptation [8]. Our results showed that the promoter elements of the VHA-H genes contained mainly the development and adaptation-related elements that respond to hormones and stress, respectively (Figure 4), suggesting that the VHA-H promoters play important roles in regulating plant responses to hormones and environmental stress. In our study, many hormones were associated with the VHA-H promoter elements, including abscisic acid (ABA), auxin, gibberellin (GA), methyl jasmonate (MeJA), and salicylic acid (SA) (Figure 4). Studies have shown that salt tolerance can be improved by GA and SA regulation [48,49], while cold tolerance can be improved by MeJA and ABA [50]. Furthermore, the ABA-responsive element (ABRE), which is the major cis-acting element for ABA-responsive gene expression, was the most abundant element of the VHA-H promoter regions (Figure 4). Studies have suggested that the control of the expression of ABA signaling factors may improve tolerance to environmental stresses [51]. For example, it was reported that ABA was associated with salt stress [52]. Similarly, NaCl-induced salt stress resulted in a significant accumulation of ABA in root tissues [53]. In addition, Shim et al. [54] reported that the content of SA increased in rice seedlings stressed by NaCl treatment. These studies suggest that there is a strong correlation between hormone regulation and environmental stress. This is supported by our study, which shows that both type and number of VHA-H promoter elements dominate responses to hormones and stress (Figure 4).
These results suggest that the VHA-H promoters are associated with stress, suggesting that the VHA-H gene is both a housekeeping gene and a stress response (ecological) gene.

Tissue-Specific Expression Patterns of VHA-H Genes
Roots and leaves are generally important organs for plant growth and development. Studies showed that V-ATPase plays important roles in nutrient absorption and translocation and in the particular functions of guard cells [11,12], which are closely related to the functions of the roots and leaves. Our results showed that most of the VHA-H genes were expressed at high levels in roots and leaves ( Figure 5), indicating that VHA-H genes may be important for the functions of these organs. We speculate that Arabidopsis maintains the features of wild plants and Setaria italica shows strong adaptability in various drought and barren growing environments. These features may be attributable to the higher expression of VHA-H genes relative to other crop species ( Figure 5). Our results showed that the expression of ZmVHA-H and SiVHA-H genes in flowers was generally low, while high in HvVHA-H, SbVHA-H, and AtVHA-H ( Figure 5). Since the flowers are one of the most important organs affecting seed yield, we further speculate that the VHA-H genes may be associated with yield in Hordeum vulgare and Sorghum bicolor.

Identification of VHA-H Genes
The protein sequences of the VHA-H gene family present in Gossypium raimondii, Oryza sativa Indica Group, Brassica napus, Beta vulgaris, Glycine max, Hordeum vulgare, Solanum tuberosum, Setaria italica, Sorghum bicolor, Triticum aestivum, and Zea mays were obtained from the EnsemblPlants database release 41 (Available online: http://plants.ensembl.org/info/website/ftp/index.html). BLASTp was used to identify the putative proteins encoded by the VHA-H genes from the 11 main crops using the protein sequence encoded by the VHA-H gene of Arabidopsis thaliana (AT3G42050, retrieved from EnsemblPlants database (Available online: http://plants.ensembl.org/index.html)) as query. A local protein database was created using BioEdit [55]. The Pfam database (Available online: http://pfam.xfam.org/) [56] was used to verify the predicted protein sequences of VHA-H genes and their corresponding two domains, i.e., V-ATPase-H-N (PF03224) and V-ATPase-H-C (PF11698). The VHA-H genes were named according to Shi et al. [10]. Characterizations of identified members of the VHA-H gene family, including accession numbers, chromosomal locations, ORF lengths, and numbers of exons and introns were retrieved from the EnsemblPlants database with one exception, i.e., the information of the chromosomal locations of canola was obtained from the Brassica napus Genome Resources (Available online: http://www.genoscope.cns.fr/brassicanapus/). Genes with incomplete gene sequences and domains were removed from further analyses. Basic physical and chemical parameters, i.e., molecular weight (MW) and theoretical isoelectric point (PI) of putative proteins of VHA-H genes, were calculated by using the compute pI/MW tool of the Expert Protein Analysis System (Available online: ExPAsy, https://web.expasy.org/compute_pi/).

Phylogenetic and Protein Motif Analyses
Amino acid sequences encoded by 24 VHA-H genes from 12 species of plants and yeast were aligned using ClustalW in MEGA-X [57]. Phylogenetic analysis of these 24 amino acid sequences was conducted by the neighbor-joining method with 1,000 bootstrap replicates using MEGA-X and by Bayesian inference using MrBayes [58] with the ScVHA-H sequence of yeast (Saccharomyces cerevisiae) as the outgroup. Conserved protein motifs of the 24 proteins putatively encoded by the VHA-H genes were identified with MEME (Available online: http://meme-suite.org/tools/meme) based on the full-length protein sequences of each putative member of the VHA-H gene family.

Gene Structure of VHA-H Genes
The Gene Structure Display Server (GSDS, available online: http://gsds.cbi.pku.edu.cn) was used to analyze the exon-intron structures within the coding sequences and the genomic sequences of each predicted VHA-H gene derived from the EnsemblPlants databases [59]. To illustrate the evolutionary patterns of introns, the phases of introns were selected when the structures of VHA-H genes were visualized with the tools at GSDS.

Splice Variants of VHA-H Genes
The sequences of splice variants of the 11 crops were retrieved from EnsemblPlants. Analysis of splice variants was conducted as previously described [60]. Specifically, all alternative splicing sites were classified into five types (i.e., exon skipping, mutually exclusive exons, alternative donor site, alternative acceptor site, and intron retention) based on [61]. The transcripts showing similar gene structure to that of Arabidopsis thaliana was selected as the wild type (WT), which was used as a reference for other gene transcripts to determine the types of alternative splicing.

Analysis of cis-acting Elements of VHA-H Promoters
The 2000 bp regions upstream of the transcription start site of all VHA-H genes in Arabidopsis and 11 crop species were obtained from EnsemblPlants. Cis-acting elements present in these upstream regions were predicted with the PlantCARE tool that is available online (Available online: http: //bioinformatics.psb.ugent.be/webtools/plantcare/html/) [62,63].

Tissue-Specific Expression Patterns of VHA-H Genes
The RNA-seq data of VHA-H genes in flower, leaf, root, and shoot were obtained from EnsemblPlants. Gene expression abundances were visualized as heat maps.

Conclusions
We have identified 22 VHA-H genes in major crop plants. We have further studied the alignments and evolutionary relationships of the amino acid sequences, gene structures of exons and introns, and protein motif and alternative splice variants of these putative genes. We suggest the N-terminal domain is the major source of protein diversity and presumably the main functional region of species-specific adaptation, while the C-terminal domain is conserved and probably retains the original functions and characteristics of an ancient V-ATPase. The VHA-H gene family in plants shows genetic structure and transcript diversities mainly in the N-terminal domain, which presumably is the main source of functional diversity of these genes. Results of this study contribute further understanding of the structure, function, and evolution of the VHA-H genes and their important species-specific roles for crop adaptation and improvement.