Sequence Diversity and Identification of Novel Puroindoline and Grain Softness Protein Alleles in Elymus , Agropyron and Related Species

The puroindoline proteins, PINA and PINB, which are encoded by the Pina and Pinb genes located at the Ha locus on chromosome 5D of bread wheat, are considered to be the most important determinants of grain hardness. However, the recent identification of Pinb-2 genes on group 7 chromosomes has stressed the importance of considering the effects of related genes and proteins. Several species related to wheat (two diploid Agropyron spp., four tetraploid Elymus spp. and five hexaploid Elymus and Agropyron spp.) were therefore analyzed to identify novel variation in Pina, Pinb and Pinb-2 genes which could be exploited for the improvement of cultivated wheat. A novel sequence for the Pina gene was detected in Elymus burchan-buddae, Elymus dahuricus subsp. excelsus and Elymus nutans and novel PINB sequences in Elymus burchan-buddae, Elymus dahuricus subsp. excelsus, and Elymus nutans. A novel PINB-2 variant was also detected in Agropyron repens and Elymus repens. The encoded proteins detected all showed changes in the tryptophan-rich domain as well as changes in and/or deletions of basic and hydrophobic residues. In addition, two new AGP sequences were identified in Elymus nutans and Elymus wawawaiensis. The data presented therefore highlight the sequence diversity in this important gene family and the potential to exploit this diversity to modify grain texture and end-use quality in wheat.


Introduction
Grain texture has major effects on the end-use quality of bread wheat (T.aestivum L.) as it influences flour's particle size distribution, starch damage, milling/air-classification properties and water absorptivity [1].Flour from hard wheat is used for making bread whilst the flour from soft wheat is preferred to produce biscuits, cakes and pastries [2].The major determinants of texture are the puroindoline proteins (Pins), PINA and PINB, which are encoded by Pin genes at the Ha (hardness) locus on chromosome 5D [3,4] and account for about 60-80% of the variation in hardness in crosses between bread wheat cultivars [5,6].These two proteins are also the major components of "friabilin", a mixture of proteins present on the surface of starch granules prepared through the aqueous washing of flour from soft wheat but not hard wheats [7].PINs are being researched extensively due to their crucial and disruptive role in dictating grain hardness [8].It has been suggested that friabilin/PINs influence grain texture by reducing adhesion between the surface of the starch granule and the surrounding protein matrix, resulting in the requirement of less energy for milling and less starch damage [9,10].Recently [11] showed that the introduction of puroindoline genes into durum wheat reduced milling energy and changed milling behavior, which was similar to soft common wheats.Equally, the silencing of the Pin genes led to an increase in grain hardness [12,13].
In addition to the Pina and Pinb genes, the Ha locus contains a third gene called Grain Softness Protein-1 (Gsp-1) [4].This gene encodes a protein which is post-translationally cleaved to give two products.A short 15-residue peptide corresponding to the protein N-terminus is glycosylated and forms the core of the arabinogalactan peptide (AGP).The second encoded protein, called "grain softness protein" (GSP), is related to the PIN proteins and was initially considered to contribute to the control of hardness [14].Although this role was disputed by [15], we have recently shown that down-regulation of the expression of Gsp-1 in transgenic plants resulted in a small but statistically significant increase in hardness [16].It has also been shown that multiple genes related to Gsp-1 also occur elsewhere in the wheat genome [14,17,18].The Gsp-1 genes now appear to be ubiquitous in grasses, being identified in 65 species from five major grass subfamilies, with over 20 different Gsp-1 alleles being reported [19].
Several studies have reported sequence diversity in Pin genes of wheat and related grass species, aimed at identifying novel variations which could affect their functional properties, and hence be exploited to increase the range of textural characteristics in wheat breeding programmes [4][5][6]17,[28][29][30][31] We have also shown that species of the genera Elymus and Agropyron species are particularly rich sources of sequence in the Gsp-1 gene, and particularly for the sequence encoding the AGP peptide sequence [19].We have therefore extended this study to survey sequence diversity in Pina, Pinb, Pinb-2v and Gsp-1 genes in a range of diploid, tetraploid and hexaploid species of the Elymus and Agropyron species as well as closely related genera (Pseudorogneria, Psathyrostachys, Thinopyrum).The wheatgrass (Agropyron, Elymus, Thinopyrum, Pseudorogneria) and wild ryegrass (Elymus and Psathyrostachys) species all share common genomic origins even though their taxonomic classification remains controversial.These genetic similarities between the genera Thinopyrum, Agropyron, Pseudorogneria, and hexaploid wheat have been clearly shown using the fluorescent in situ hybridization (FISH) technique [32] and have allowed for hybridization between these species (including barley, durum wheat and rye) to introgress novel variations in breeding programmes [33][34][35].
Our study has identified novel variations in the protein sequences PINA, PINB, PINBv-2 variant and Gsp-1 genes in Elymus, Agropyron and closely related species, highlighting the diversity in this functionally important multigene family.

Results
The sequences of Pina, Pinb, Pinb-2v and Gsp-1 genes were determined for species of Elymus (eight species, eight accessions), Agropyron (three species, three accessions), Thinopyrum (three species, three accessions), Psathyrostachys (one species) and Pseudorogneria (one species).Individual alignments of protein sequence types are shown in Figures 1-3, while a phylogenetic tree for all sequences analyzed shown in Figure 4.The sequence results and genome relationships of these species are summarized in Table 1, while the full nucleotide and protein sequences for all genes are given in the supplementary data: Pina S2-S7; Pinb S8-S15; Pinb-2 variant S16-S19 and Gsp-1 S20-S23.All novel sequences were submitted to the European Nucleotide Archive.

Sequence Diversity in PINA Proteins
PCR screens using Pina-specific primers produced amplicons of approximately 450 base pairs from all Elymus and Agropyron species except for Elymus trachycaulus subsp.subsecundus, Elymus wawawaiensis and Thinopyrum elongatum.Two types of PINA sequence were detected in Elymus burchan-buddae, Elymus dahuricus subsp.excelsus and Elymus nutans.One was a novel form (GenBank accession number LT669797; sequence ID SHD75392), encoding a protein with 89% identity to PINA sequences from Aegilops kotschyi [29].Alignment with wild-type PINA from T. aestivum cultivar Cadenza (PINA-D1a) (Figure 1) shows unique amino acid substitutions and deletions: Ser22Gly, four residues deleted corresponding to 29-32 and Asp33Gly in all species and additional substitutions Asp33Lys, Gly35Val, Val41Leu, Ser57Thr, Arg131Lys, Asn139Asp, and Pro141Arg substitutions in Elymus dahuricus subsp.excelsus.The second sequence type encoded a protein with 99-97% identity with published sequences from Elymus libanoticus [39] with unique Ser22Arg and Ser119Gly substitutions compared to wild-type protein sequences.The PINA sequences from Elymus sibiricus, Agropyron cristatum, Agropyron mongolicum, Thinopyrum scripeum (100%), Thinopyrum bessarabicum (99%), Agropyron repens (98%) and Pseudoroegneria spicata (97%) showed high identity with the bread wheat sequence (accession number P33432; [40]).The sequence from Psathyrostachys juncea showed 99% identity to published sequences from Psathyrostachys juncea (accession number AER62828; [39]) with a glycine insertion at position 29 of the mature protein.The two Elymus angulutus sequences both had two stop codons at positions 72 and 113 of the mature proteins, meaning that they are unlikely to be expressed.Examples of the

Sequence Diversity in PINA Proteins
PCR screens using Pina-specific primers produced amplicons of approximately 450 base pairs from all Elymus and Agropyron species except for Elymus trachycaulus subsp.subsecundus, Elymus wawawaiensis and Thinopyrum elongatum.Two types of PINA sequence were detected in Elymus burchan-buddae, Elymus dahuricus subsp.excelsus and Elymus nutans.One was a novel form (GenBank accession number LT669797; sequence ID SHD75392), encoding a protein with 89% identity to PINA sequences from Aegilops kotschyi [29].Alignment with wild-type PINA from T. aestivum cultivar Cadenza (PINA-D1a) (Figure 1) shows unique amino acid substitutions and deletions: Ser22Gly, four residues deleted corresponding to 29-32 and Asp33Gly in all species and additional substitutions Asp33Lys, Gly35Val, Val41Leu, Ser57Thr, Arg131Lys, Asn139Asp, and Pro141Arg substitutions in Elymus dahuricus subsp.excelsus.The second sequence type encoded a protein with 99-97% identity with published sequences from Elymus libanoticus [39] with unique Ser22Arg and Ser119Gly substitutions compared to wild-type protein sequences.
The PINA sequences from Elymus sibiricus, Agropyron cristatum, Agropyron mongolicum, Thinopyrum scripeum (100%), Thinopyrum bessarabicum (99%), Agropyron repens (98%) and Pseudoroegneria spicata (97%) showed high identity with the bread wheat sequence (accession number P33432; [40]).The sequence from Psathyrostachys juncea showed 99% identity to published sequences from Psathyrostachys juncea (accession number AER62828; [39]) with a glycine insertion at position 29 of the mature protein.The two Elymus angulutus sequences both had two stop codons at positions 72 and 113 of the mature proteins, meaning that they are unlikely to be expressed.Examples of the sequences detected in the species are shown in an alignment in Figure 1.The effect of these shorter PINA sequences from Elymus burchan-buddae, Elymus dahuricus and Elymus nutans on the protein structure, especially the tryptophan-rich loop region which is characteristic of PINs, was investigated using the Phyre2 web portal for protein modelling, prediction and analysis [41] (Figure 5 and Supplementary Figure S1).A turn is a secondary structure, structural motif where the C α atoms of two residues separated by 1-5 peptide bonds are closer than normal (less than 7 Å [0.70 nm]), leading to the formation of an inter main chain hydrogen bond between the corresponding residues.The signal peptide (residues 1-28/24) has been removed as it would be in the post-translational processing.

Sequence Diversity in PINB-2 Proteins
Pinb-2 gene sequences were not detected in all species.However, a new PINB-2 sequence type was present in both Agropyron repens and Elymus repens (GenBank accession number LT669798; sequence ID SHD75393).This form contains a Ser65Pro substitution, as observed in several barley species [47], and three residues (Asn-Leu-Glu) at positions 122-124 instead of the five residues (Lys-Gln-Ile-Gln-Arg) in the barley proteins.The Ser65Pro substitution was also present in sequences from E. angulutus and three sequences from Elymus repens but these also contained the Lys-Gln-Ile-Gln-Arg sequence rather than Asn-Leu-Glu (Figure 3).PINB-2v1 type sequences containing single amino acid substitutions were detected in E. angulutus, E. nutans, E. burchan-buddae, and E. sibiricus, (Figure 3).PinB-2v1 type sequences from E. angulutus, E. burchan-buddae, and E. sibiricus had 97% sequence identity with published sequences for T. aestivum Accession number AFB35605 [26] while Elymus nutans sequences had 93% identity with PinB-2v2-2 Accession number AFB35607 [26].A second sequence type found in E. nutans had 97% sequence identity with hordoindoline sequences from H. brachyantherum subsp.californicum [44], although it contained a stop codon at position 57 of the protein sequence.

Gsp-1 Sequences in Elymus Species
Two new AGP sequence motifs were detected in Elymus nutans and Elymus wawawaiensis, respectively (Supplementary Figure S22).One of these, in E. nutans (GenBank accession number LT669799; sequence ID 75394) was like the AGP7 and AGP8 motifs reported by [19] with 93% identity with Australopyrum retrofractum and Elymus libanoticus sequences [36].This AGP motif differs from AGP7 and AGP8 in Tyr21Phe, Ala22Val and Ala34Gly amino acid substitutions while the GSP-1 protein encoded by the same sequence had substitutions including Ser79Tyr, Phe81Val, Leu128Phe, Ile149Met and Phe153Asp, which have been reported before in GSP-1 protein sequences [19].
The AGP motif in E. wawawaiensis (GenBank accession number LT669800; sequence ID SHD75395) is like AGP13 [19], differing only in an Ala29Ser amino acid substitution, and has 96% identity with sequences in H. bogdanii Accession ADH94955 (Cenci et al., 2009, direct submission).Novel amino acid substitutions within the associate GSP-1 sequence included Glu88Lys and Leu106Pro substitution, not reported before [19].The two new motifs are aligned with AGP8 and AGP13 in Supplementary Figure S26.

Discussion
For more than 20 years, gene mining of the Ha locus has provided valuable information on the diversity of the puroindoline genes and has identified variations in the sequences of the encoded proteins that have had an impact on their functionality, e.g., [4][5][6]17,28,[49][50][51][52].However, these studies have focused on the Pin gene family and on the progenitors of bread wheat and related wild grass species, e.g., [29][30][31]39,53].
This study characterized Pina, Pinb, Pinb-2 and Gsp-1 genes, focusing on diploid, tetraploid and hexaploid species of the genera Elymus and Agropyron species which are less closely related to wheat, but also include more closely related species.Pinb and Gsp-1 genes were identified in all sixteen species that were screened, Pina genes in thirteen of the sixteen and Pinb-2 genes in ten of the sixteen.Although the levels of sequence conservation were generally high, as reported by [19], some amino acid positions showed high levels of variation in all protein sequences.The differences between forms of PINA and PINB-2 proteins included in-frame deletions of amino acids while some PINB proteins had changes within the tryptophan-rich loop which is suggested to be essential for their role in determining grain texture (possibly by direct interaction with the starch granule surface) [48].
A combination of infrared and Raman spectroscopy and showed that PINA and PINB proteins have a secondary structure consisting of 30% α-helix, 30% β-sheet with 40% unordered structure [54].The PIN and GSP proteins belong to a large family of seed storage proteins called the "prolamin superfamily" [55] and have been predicted to have similar structures to other members of this family, namely the 2S storage albumins occurring in dicotyledonous seeds [56].The InterPro database 66.0 identified all the proteins analyzed as part of this bifunctional inhibitor/lipid transfer protein/seed storage 2S albumin superfamily.A comparison with related proteins in this database suggests that residues 49-139 of the PINA, PINB and PINB-2 proteins form a bifunctional inhibitor/plant lipid transfer protein/seed storage helical domain (named the IPR016140 domain), and PINB protein residues 50-59, 79-90, 92-101 and 125-140 form a cereal seed allergen/grain softness/trypsin and alpha-amylase inhibitor domain (named the IPR006106 domain).The AGP/GSP1 proteins analyzed were also shown to have IPR016140 domains (i.e., residues 63-154) and, similarly to PINB proteins, have IPR006106 domains (i.e., residues 63-72, 92-103, 105-114 and 138-153).These domains represent a structural region consisting of 4-helices with a folded leaf topology, and forming a right-handed super helix.1H and 15N NMR-Spectroscopy confirmed that these finger millet (ragi) proteins form this globular 4-helix motif with a simple 'up-and-down' topology, including a short anti-parallel beta-sheet [57].
The PINA sequences from Elymus sibiricus, Agropyron repens, Agropyron cristatum, Agropyron mongolicum, Thinopyrum bessarabicum, Thinopyrum scripeum and Pseudorogneria spicata all showed high sequence identity (97-100%) with the published sequence from bread wheat [30,40,45] with only eight out of 148 residues (5.4%) showing variation in more than one sequence analyzed (Figure 1). Figure 1 shows that, of these eight residues, four substitutions showed no change in amino acid type, i.e., Val42Leu and Val64Leu both being hydrophobic, Glu43Asp being acidic and Ser58Thr being polar in nature.The four remaining substitutions were changes from polar to basic (Serr22Arg), polar to hydrophobic (Ser120Gly), polar to acidic (Asn140Asp) and hydrophobic to polar (Gly143Ser).There is no variation in the signal peptide (1-19 residues) and all the sequences maintained their ten cysteine residues.High conservation of PINA sequences was also observed by [30], who suggested that this was consistent with a role in plant defense.However, although in vitro studies have suggested that Pin proteins may contribute to plant defense [26], it should be noted that the biological roles of Pins, GSP and AGP have not been established in planta, and hence discussions of structure: functional relationships are speculative.
A new PINA sequence type, with the deletion of four amino acids at position 29-32 of the wild-type protein, was detected in Elymus burchan-buddae, Elymus dahuricus subsp.excelsus and Elymus nutans.The most closely related sequence is one reported for Aegilops speltoides (89% identity) [29] and the phylogenetic analysis (Figure 4) shows a clear separation of these sequences from other types of PINA.3D model structures were generated using the Phyre2 web portal for protein modelling, prediction and analysis [41] and to predict any putative if any effects of these deletions/shorter sequences and amino acid variations on overall protein structure.Figure 5a is the best fit alignment of the wild type PINA to those detected in the Elymus species, and shows changes to the overall protein structure and to the tryptophan loop.We can only speculate that these changes may be due to amino acid changes detected in these sequences shown in Supplementary Figure S1.[8] used similar software, i.e., I-TASSER, to generate 3D structures to predict changes to their PINA proteins from synthetic hexaploid wheat lines.The structures generated were similar to the ones shown in Figure 5 with the four characteristic alpha helices, with the tryptophan-rich-domain (TRD) likely to form a loop also confirmed [8].
The PINB sequences were less conserved than the PINA sequences, with twenty-six of the 148 amino acid residues showing two or more amino acid substitutions (17.6%), some of which were specific to sequence type (Figure 2).Completely novel sequence types were identified which were 91% identical to sequences from Aegilops speltoides [45] (GenBank accession number LT669796; sequence ID SHD75391, LS991245 and LR025202), while sequences from several species with the H genome (shared with Hordeum species) (Agropyron repens, Elymus nutans, Elymus trachycaulus and Elymus wawawaiensis) were more like those of hordothionins in the Hordeum species [33,43].The sequences from the Elymus species included several novel variants of PINB sequences within the Elymus genus and two hexaploid Elymus species (Elymus dahuricus subsp.excelsus and Elymus nutans) each had three distinct PINB sequence types.
Of the 26 residues showing substitutions, ten were conserved in that they did not result in a change in the type of amino acid (Trp27Gly, Val54Met, Met55Leu, Val66Met, Leu89ProPhe112Ile, Leu113Phe, Val120Ile, Leu124Ile, Arg108Lys).However, the Leu89Pro mutation is known to confer the hard phenotype (PinB-1c (Rahman et al., 1994)).A change from a hydrophobic to a basic amino acid at position 73 (which is within the tryptophan loop region) occurred in sequences from Agropyron mongolicum, Elymus angulutus and Elymus sibiricus.Because of its position, this substitution could be expected to affect grain hardness and we have indeed found the same mutation in the hard T. aestivum cultivar Mercia (authors' unpublished results).Other mutations resulted in substitutions of polar for hydrophobic residues (Gly26Ser, Try27Ser, Gly36Ser, Gly75Ser, Gly107Ser, Phe121Tyr, Ala138Ser) and polar for basic residues (Glu43Asp, Lys46Asn, His79Gln, Lys87Gln).Domains IPR016140 and IPR006106 showed a lot more variation than for PINA proteins (i.e., Gly75Ser, His79Gln, Lys87Gln, Leu89Pro, Gly107Ser, Phe121Tyr, and Ala138Ser substitutions observed).All PINB sequences contained ten cysteine residues except for one sequence from Psathyrostachys juncea which contained a Cys86Try mutation (Figure 2): this could have a deleterious effect on protein stability as Cys86 is predicted to form a disulphide bond with Cys134 [26].
Pinb-2v1 genes occurred in more species than the other three gene types of Pinb-v genes, being detected in eight of the 16 species.Although they all had 97% sequence identity with sequences detected in bread wheat [26], unique amino acid substitutions were observed in the individual species (Figure 3).Some of the substitutions also occurred in the tryptophan loop, for example, Ser65Pro, Arg68Lys, Leu69Phe mutations and Leu69Ile (Domain IPR016140).Pinb-2v3 sequences were only found in Pseudorogneria spicata, Thinopyrum bessarabicum and Thinopyrum elongatum while Pinb-2v2 was only found in Pseudorogneria spicata.However, a new PINB-2 variant form was detected in Agropyron repens and Elymus repens (GenBank accession number LT669798; sequence ID SHD75393), which contained the tripeptide motif Arg-Leu-Gly instead of the pentapeptide Arg-Gln-Ile-Gln-Arg at positions 122 to 12.This study has identified two new AGP/GSP sequences from the Elymus species.
The phylogenetic analysis in Figure 4 shows a clear grouping of the different sequence types with the Gsp-1 gene sequences being quite separate from the sequences of all the Pin genes.The results also clearly show that the Pinb-2 variant genes arose from the ancestral Pinb gene after the Pinb and Pina ancestral lines had separated.
In most cases, the number of different sequences detected in the species was directly related to ploidy levels.This was especially true for the Pinb gene where two to three different sequences were detected in the tetraploid Elymus trachycaulus subsp.subsecundus, Elymus burchan-buddae and Elymus sibiricus, the hexaploid Elymus dahuricus subsp.excelsus and Elymus nutans.Two different Pina sequences were also detected for Elymus burchan-buddae, Elymus dahuricus subsp.excelsus and Elymus nutans.The results reported here therefore contribute new information on the sequence diversity of Pin and Gsp-1 genes in wild-grass species related to wheat, showing clear differences between the degree of sequence conservation of the encoded proteins, and within the individual protein sequences.Because all the species studied can be hybridized with wheat, it will be possible to exploit variations in traits resulting from this diversity in wheat improvement programmes.However, our ability to predict the biological significance of this diversity is currently limited by our lack of knowledge of the biological roles of the proteins in planta.In particular, the biological relevance of the effects on grain texture is unclear, with no obvious selective advantage, while other biological properties demonstrated in vitro (such as defense against pathogens and lipid binding) have failed to be confirmed in planta.

Figure 4 .
Figure 4. Phylogenetic tree of the nucleotide sequences of Pina, Pinb, Pinb-2v and gsp-1 for all species screened, with known Triticum aestivum sequences.Separations between Gsp-1 and Pin genes are shown, with Pinb-2 variant genes arising from the ancestral Pinb gene, after Pinb and Pina separation.The branches were transformed equally, and branches ordered in increasing order.The species codes are those used in the Figure legends from the alignments shown in Figures 1-3.

Figure 4 .
Figure 4. Phylogenetic tree of the nucleotide sequences of Pina, Pinb, Pinb-2v and gsp-1 for all species screened, with known Triticum aestivum sequences.Separations between Gsp-1 and Pin genes are shown, with Pinb-2 variant genes arising from the ancestral Pinb gene, after Pinb and Pina separation.The branches were transformed equally, and branches ordered in increasing order.The species codes are those used in the Figure legends from the alignments shown in Figures 1-3.

Figure 5 .
Figure 5. Predicted 3D structures of PINAs from: (a) T. aestivum (wild-type); (b) Elymus burchan-buddae; (c) Elymus dahuricus; and (d) Elymus nutans using the Phyre2 web portal for protein modelling.The amino acid location of the shorter sequences (i.e., EGV) compared to the wild-type sequence (i.e.DVAGGGG), tryptophan-rich domain (TRD) and alpha helices are indicated.The region in purple always indicates a pi-helix.Regions in blue indicate a protein turn.A turn is a secondary structure, structural motif where the C α atoms of two residues separated by 1-5 peptide bonds are closer than normal (less than 7 Å [0.70 nm]), leading to the formation of an inter main chain hydrogen bond between the corresponding residues.The signal peptide (residues 1-28/24) has been removed as it would be in the post-translational processing.

Figure 5 .
Figure 5. Predicted 3D structures of PINAs from: (a) T. aestivum (wild-type); (b) Elymus burchan-buddae; (c) Elymus dahuricus; and (d) Elymus nutans using the Phyre2 web portal for protein modelling.The amino acid location of the shorter sequences (i.e., EGV) compared to the wild-type sequence (i.e.DVAGGGG), tryptophan-rich domain (TRD) and alpha helices are indicated.The region in purple always indicates a pi-helix.Regions in blue indicate a protein turn.A turn is a secondary structure, structural motif where the C α atoms of two residues separated by 1-5 peptide bonds are closer than normal (less than 7 Å [0.70 nm]), leading to the formation of an inter main chain hydrogen bond between the corresponding residues.The signal peptide (residues 1-28/24) has been removed as it would be in the post-translational processing.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1424-2818/10/4/114/s1, Figure S1: Protein alignment of novel PINA sequences from Elymus species with wild type from T. aestivum (bread wheat).Black boxes indicate main amino acid motif differences between the sequences.Those in the green boxes indicate same amino acid type (i.e.hydrophobic or hydrophilic) change; Figure S2: Nucleotide alignment of Pin a sequences for Elymus angulutus, Elymus nutans, Elymus burchan-buddae, Elymus sibiricus and Elymus dahuricus.Number refers to clone number; Figure S3: Protein alignment of PINA sequences for Elymus angulutus, Elymus