2.1. Identification and Classification of R. serbica LEAPs
Previously, we performed transcriptomic analysis of
R. serbica hydrated leaves (HL) gene expression under regular watering conditions [
46]. Since our aim was to identify and characterise desiccation-induced late embryogenesis abundant (LEA) genes in
R. serbica leaves, we improved our database and expanded it on desiccated leaves (DL). The completed
R. serbica de novo transcriptome database is available at:
https://zenodo.org/record/6341873#.YijgJ_7MJPY, accessed on 22 March 2022 (10.5281/zenodo.6341873) and translated into amino acid sequences at:
https://zenodo.org/record/6340979#.YiitWP7MJPY, accessed on 22 March 2022 (10.5281/zenodo.6340979). The sequence data from this article can be found in the Short Read Archive database at NCBI under accession numbers SRR18015613 and SRR18015612 (bioproject accession no. PRJNA806723 and sample accession no. SAMN25859880). Overview of the data production quality, length distribution, and number of transcripts and unigenes and annotated unigenes is given in
Supplementary Materials Table S1. In total, 49.1% of annotated sequences showed the best matches with
B. hygrometrica Bunge. R. Br. (homotypic synonym:
Dorcoceras hygrometricum) sequences (
Supplementary Figure S1).
The NCBI NR protein database search of the obtained merged transcripts of both HL and DL using Basic Local Alignment Search Tool (BLAST) listed 433 members of the LEA gene family (
Supplementary Table S2). The obtained
R. serbica LEAPs sequences were highly homologous with
Striga asiatica LEAPs, followed by
Capsicum annuum (
Supplementary Figure S2). Almost 20 hits were related to LEAPs identified in
D. hygrometricum. The final set of 318
R. serbica LEAPs was created upon removing proteins consisting of less than 100 amino acids from the list of 359 LEAPs containing LEA domains (
Supplementary Table S2).
According to the annotated LEA domains, all identified LEAPs were grouped into seven protein family groups, ranging from LEA1 to LEA5, dehydrins, and seed maturation proteins (SMPs), as adopted by Reference [
14] (
Supplementary Table S2). The most populated
R. serbica LEA protein family group was LEA2, containing 127 proteins (almost 40% of the total identified LEAPs), followed by LEA4, which encompassed 96 proteins (~30%), while the smallest group, LEA5, included 11 proteins (
Table 1).
The phylogenetic analysis revealed that proteins belonging to the same LEA protein family group were phylogenetically related, with at least one clade with a common node. The exception was twenty-five LEA4 protein family group members that belonged to separate, independent clades (
Supplementary Figure S3). These proteins were evolutionary the most distant from the LEA2 proteins, as indicated by their positions on the opposite sides of the unrooted tree. In total, a hundred closely related gene pairs/paralogues were observed within all LEA groups.
To determine the homology of
R. serbica LEA protein family groups with those well-annotated in
A. thaliana [
14] and in upland cotton [
15], multiple sequence alignment (MSA) within the respective LEA protein family groups was done (
Table 1). Phylogenetic analyses indicated the highest sequence homology between
R. serbica and
A. thaliana LEAPs within the LEA5 protein family group (~60%) (
Table 1). Regarding the sequence similarities between
G. hirsutum and
R. serbica LEAPs, the highest value, almost 34% homology, was detected for dehydrins.
2.2. Physicochemical Analysis of R. serbica LEAPs
The physicochemical characteristics (like sequence length, pI, amino acid composition, protein’s molecular weight, and grand average hydropathy—GRAVY) for all
R. serbica LEAPs are tabulated in
Supplementary Table S2. The
R. serbica LEA proteins were observed to have variable amino acid sequence lengths up to 444 aa (LEA4 protein group) corresponding to molecular weight of 44.9 kDa. The average sequence length of the
R. serbica LEA2 family members was the highest (~226 aa), followed by LEA4 (~187 aa), while LEA5 proteins were the shortest (118 aa) (
Table 2). Members of the LEA4 protein family group were observed to exhibit the most variable sequence lengths and molecular weights (
Supplementary Materials Table S2).
The LEA2 and LEA1 protein family group members were the most basic, with the average pI = 8.2–8.4, while the SMPs were mostly acidic, with an average pI value of 4.9 (
Table 2). The GRAVY index values were negative for all
R. serbica LEA proteins, except for some members of the LEA2 protein family group, although the average GRAVY value for the most hydrophobic group was −0.09 (
Table 2). The calculated GRAVY indices indicated that
R. serbica dehydrins were the most hydrophilic, showing the most negative GRAVY index, followed by LEA5 and LEA4 (
Figure 1).
The average amino acid composition of each
R. serbica LEA protein family group is presented in
Table 2 and
Supplementary Table S2. The percentage of cysteine was generally low in the identified LEAPs. It was the highest within the LEA2 protein group members and one half of dehydrins and lowest in proteins belonging to the LEA1 and LEA4 groups (
Figure 1 and
Supplementary Table S2). The charged amino acid content was the highest in dehydrins, followed by LEA4 proteins (
Table 2). Accordingly, dehydrins and LEA4 family members contained the highest content of Lys (13–17%), Glu (up to 18%), and Asp (up to 10%) compared with the other LEA protein groups. The contents of aliphatic residues (Ile, Val, and Leu) were the highest in the LEA2 protein family group members, followed by SMPs in the case of Val. The exceptionally high content of alanine residues (up to 20%) was found in the LEA4, SMP, and LEA1 protein group members. Proline was the most abundant in dehydrins and LEA3 proteins. The histidine percentage was the greatest in dehydrins, followed by LEA1 protein family members (
Figure 1). Among
R. serbica LEAPs, the glycine content was the highest in DEH1, LEA1, and LEA5 protein family groups. The tryptophan content was generally low among
R. serbica LEAPs; it was the greatest in proteins belonging to the LEA3, LEA2 and LEA4 protein families (~1% of the total sequence length).
2.3. Homology Motifs Analyses of R. serbica LEAPs
To gain more information regarding structural diversity and conserved motif divergence of LEAPs from
R. serbica within seven distinctive LEA protein family groups, a domain architecture analysis was performed.
Supplementary Figures S4–S10 present conserved motif composition analysis of each LEAP (318) from
R. serbica within the specific LEA protein family group. To simplify the presentation and stress particular differences among the groups, the representatives with unique motif patterns were selected and are presented in
Figure 2.
The LEA1 protein family group includes LEAPs similar in length (
Supplementary Table S2) that mostly contained (80%) two highly conserved motifs: M1.1 and M1.2 (
Supplementary Figure S3) (
Figure 2). Both motifs were recognised as “LEA1” protein family domains by the Pfam database. Members of this protein family group could be clustered into three subgroups: LEA1.1 (M1.1 and M1.2 motifs), LEA1.2 (M1.1 motif), and LEA1.3 subgroup (M1.2 motif). The first, M1.1 motif contained 50 aa, with three conserved Lys residues present in almost all LEA1 members and seven highly conserved alternating Glu/Asp residues (
Table 3). In addition, the M1.1 consensus sequence contained 20 charged of the 50 total residues, with a GRAVY index that indicated the M1.1 motif was very hydrophilic (
Table 3). The second M1.2 motif encompassed 21 aa, including two conserved Lys, two Glu, one Gly, and seven Ala residues.
According to the homologous motifs, the LEA2 family was clustered into five major subgroups: LEA2.1, LEA2.2, LEA2.3, LEA2.4, and LEA2.5. Nine motifs were identified in this LEA group (
Supplementary Materials Figure S5 and
Table 3). Motifs M2.2, M2.3, M2.4, M2.5, and M2.8 contained “LEA2” protein family domains, according to the Pfam database. Subgroup LEA2.2 contained motifs M 2.1, M2.2, and M2.6, while the “extended” subgroup LEA2.1 contained two additional motifs: M2.3 and M2.5. Additionally, all LEAPs belonging to subgroup LEA2.3 encompassed the M2.4 motif, while members of two clusters within this subgroup contained additional motifs M2.6 or M2.7 (
Figure 2). Motifs M2.6 and M2.7 with dominant nonpolar residues were the most hydrophobic among all the motifs detected in
R. serbica LEAPs (
Table 3). Motif M2.9 was a determinant motif for the subgroup LEA2.4, although some members of this subgroup contained motif M2.6 as well (
Figure 2). Proteins within the subgroup LEA2.5 differentiated from the other LEA2 protein members by the presence of the M2.8 motif (
Supplementary Figure S5).
The LEA3 protein family group was clustered into two subgroups: LEA3.1 and LEA3.2 (
Figure 2). With the exception of RsLEA_42, four highly conserved motifs (M3.1, M3.2, M3.3, and M3.5) were found in the LEA3.1 group (
Supplementary Materials Figure S6). Interestingly, motifs M3.1 and M3.2 were rich in proline and glycine residues and contained almost ten completely preserved charged amino acids (
Table 3). In addition, motif M3.1 contained a conserved Trp residue. Motif M3.3 (29 aa) was rich in Ser, Arg, and aliphatic amino acids, similar to motif M3.2, while, in short, motif M3.5 (6 aa) Arg, His, and Val were the dominant residues. Three LEA3.2 proteins contained a single motif M3.4 recognised as the “LEA3” protein domain, according to the Pfam database (
Table 3).
Four distinctive motifs were identified in the LEA4 protein family group (
Supplementary Figure S7 and
Table 3). All members of the LEA4 protein family group contained the M4.1 motif, rich in charged amino acid residues. Subgroup LEA4.1 members contained the most polar
R. serbica LEA motif M4.3 (GRAVY index = −1.96), LEA4.2 contained motif M4.2, and all other LEA4 protein members were nested into LEA4.3 (
Figure 2 and
Supplementary Figure S7). Except M4.2, all motifs identified in the LEA4 protein family group were very polar and rich in charged amino acid residues (>50%), namely lysine (20–33 %) (
Table 3). Indeed, almost a quarter of the
R. serbica LEA4 protein group members contained at least one of the motifs from the KYS and Lys-rich motif classification system (
Supplementary Table S4). Based on the Pfam database, “LEA” protein domains were found in motifs M4.1 and M4.4.
Only 11 LEAPs formed the
R. serbica LEA5 protein family group (
Supplementary Table S3 and
Supplementary Figure S8), among which nine LEAP encompassed the highly conserved motif M5.1 (
Figure 2 and
Table 3). In eight LEAPs of this group, motif M5.2 was found with strongly preserved glycine and charged residues (
Table 3). Pertinent to that, the GRAVY index of these two motifs (almost −1.2,
Table 3) indicated their high polarity.
Based on the motif homology, dehydrins were clustered in two subgroups: DEH1 and DEH2, which contained four distinct polar motifs (M6.1–M6.4) (
Figure 2). Members of the DEH1 subgroup were determined by the motif M6.3, rich in glycine, proline, and tyrosine residues and negatively charged amino acids (
Figure 2 and
Table 3). This motif contained the commonly called Y-segment DEYGNP (
Table 3). Almost 84% of the
R. serbica dehydrins contained motif M6.1, encompassing the commonly called K-segment: KKG[_N][MF]M[DE]KIKEK (
Table 3 and
Table S4). The greatly conserved motif M6.2 was prevalently composed of eight Ser (so-called S-segment), six Gly, three Pro, and eight charged residues (
Table 3). In
R. serbica dehydrins, the prevalent S-segment was SGSSSSSSS (namely in the DEH1 protein subgroup), although the S7, S8, and TGSSSSSS motifs were detected as well (
Supplementary Materials Table S4). Conserved motif M6.4 contained mainly charged amino acids, Pro, and Gly, similar to other motifs in this family group. Motifs M6.1, M6.2, and M6.4 encompassed the “dehydrin” protein family domain, as indicated by the Pfam database. Taken together, all dehydrins identified in
R. serbica contained at least one dehydrin-determining segment (
Supplementary Table S4).
Seed maturation proteins (SMPs) were clustered into three subgroups: SMP1, SMP2, and SMP3, depending on the presence and absence of two detected motifs: M7.1 and M7.2 (
Figure 2 and
Supplementary Materials Figure S10). Motif M7.1 involved seven fully preserved alanine, three valine, and four glycine residues, as well as five negatively charged residues, in all proteins. The shorter motif M7.2 encompassed mostly aliphatic (namely, Ala and Val) residues, leading to an almost positive GRAVY index. Both motifs were recognised as “SMP” protein family domains by the Pfam database.
2.4. Structure and Disorder Prediction of R. serbica LEAPs
R. serbica LEAPs significantly differ in their secondary structure, disorder propensity, and aggregation potential between distinct LEA protein family groups (
Supplementary Table S3). Five secondary structure predictors showed that more than 30% of the identified
R. serbica LEAPs exhibited a high propensity to form α-helices (>70% of the total sequence length), while almost 35% of all identified
R. serbica LEAPs showed the potential to form β-sheets in at least 30% of their sequence length. Almost 25% of LEAPs found in
R. serbica leaves exhibited a propensity to organise at least 50% of their sequence in the form of a random coil.
Particularly, the LEA4 protein family group exhibited a high propensity to form α-helices (in the range 71–97% of the sequence length). On average, only ~1% of the
R. serbica LEA4 family members sequence was predicted to form β-sheets (
Figure 3). In addition, a very low propensity for adopting β-sheet conformation (up to 5% of the sequence length) exhibited also members of dehydrins, and the majority of LEA1 protein family. On the contrary, the LEA2 family group, particularly the LEA2.3 subgroup, showed a high potential to form β-sheets and a low propensity for α-helices (
Supplementary Table S3). The positive correlation between the percentage of the sequence predicted to adopt a random coil and the sequence length among the LEA2 protein family subgroups was noticed. For example, members of the LEA2.4 subgroup with an average length of 298 aa exhibited a propensity to undergo random coil conformation for 58% of the sequence length. The prevalent conformation observed in the members of dehydrins (particularly, the DEH1 subgroup, 76% of total sequence length), LEA3 (63%), and SMP (51%) family groups was random coil (
Supplementary Table S3).
To get more information regarding α-helices within
R. serbica LEAPs, the structural properties of the obtained protein motifs (
Table 3) were analysed. Motif M1.1 intended to form a charged α-helix, with distinctive positive and negative faces, while, in M1.2, a hydrophobic face was also proposed (
Figure 4). Motifs M2.5 and M2.8 exhibited a low tendency to adopt amphipathic α-helical structures. In the
R. serbica LEA3 protein family group, the only motifs predicted to form an α-helical structure were M3.2 and M3.4, but no hydrophobic face was modelled. All four motifs in the LEA4 protein family group were predicted to be organised as α-helices (
Figure 4). According to the HeliQuest results, they all, except the M4.2 motif, contained negatively charged faces, while motifs M4.1, M4.2, and M4.4 exhibited hydrophobic faces as well (so-called A type of the α-helix). On the contrary, motifs M5.1 and M5.2 showed a lower tendency to form α-helices with no hydrophobic faces (
Figure 4). Only two motifs characteristic for
R. serbica dehydrins, M6.1 and M6.4, were predicted to form α-helices, while, in the motifs M6.2 and M6.3, the dominant conformation for more than 94% of the total sequence length was the random coil. In addition, both motifs identified in the SMP family group were predicted to form α-helices. Moreover, M7.2 tended to form a hydrophobic face (
Figure 4).
Surprisingly, despite a low propensity for folding into α-helical conformation, the presence of at least one transmembrane α-helix (TMH) within the
R. serbica LEA2 protein family was predicted both by TMHMM and FELLS predictors (
Supplementary Table S3). For example, almost all LEAPs belonging to the subgroups LEA2.3–5 were predicted to form at least one TMH comprised of approximately 20 amino acids, while, in only two protein members of both the LEA2.1 and LEA2.2 groups, a single distinctive TMH was observed. In addition, in seven LEA2.3 group protein members, the additional TMH (two in total) was observed. In total, 32 different and hydrophobic TMH domains were identified in 87 TMH-containing proteins belonging to the LEA2 protein family group (
Supplementary Table S5 and
Supplementary Figure S11). On the other hand, members of the SMP, dehydrin, LEA1, LEA3, and LEA5 protein family groups were predicted to be soluble—no transmembrane domains were predicted (
Supplementary Table S3).
Besides these three elements of protein secondary conformation, we analysed the disorder propensity of the identified
R. serbica LEAPs. As predicted by several bioinformatic tools, more than 55% of the identified LEAPs were found to be disordered (>50% of the sequence length) (
Supplementary Table S3). Indeed, more than 92% of the
R. serbica LEAPs (with the exception of the LEA2 protein group) exhibited a propensity to be disordered.
Comparisons between seven
R. serbica LEA protein family groups showed that, on average, dehydrins (particularly, DEH1 members) and LEA1 exhibited the highest propensity for the disorder (87–97% of the total sequence length), followed by LEA4 (80–83%) of the total sequence length) and LEA5 (79%) (
Supplementary Table S3). On the contrary, members of the LEA2 protein family group showed the highest hydrophobic effect and the lowest disorder propensity (22% of the sequence length), except in the case of the LEA2.4 subgroup, where the disorder propensity was twice higher.
These findings were positively correlated with the predicted number and size of the globular domains (
Supplementary Table S3). All the
R. serbica LEA2 family members were predicted to form a single globular domain, occupying between 94 and 96% of the sequence length in the case of all LEA2 groups, except the LEA2.4 protein subgroup. On the contrary, no globular domain was predicted among all dehydrin, LEA1, and LEA4.1 protein members. Within the LEA4.2 protein subgroup, 11 of the 35, and within the LEA4.3 subgroup, 7 of the 47 members were predicted to fold into a single globular domain. Almost 83% of the LEA3 protein family members were predicted to fold into a single globular domain, while 35% of
R. serbica SMPs were predicted to be organised into one or two globular domains.
The obtained information derived from the representative structural model is the key to understanding the function of LEAPs and the regulation of their intrinsic structural disorder-to-order transition during desiccation. Therefore, to incorporate all the structural findings and predictions, we constructed 3D models with prediction quality of the representatives of seven LEA protein members (
Figure 5 and
Supplementary Figure S12).
As already presented, in the RsLEA_86 protein, a member of the LEA1 protein group, two distinctive α-helices encompassing the M1.2 and M1.1 motifs at the N-terminus and a random coil at C-terminus were obtained. Sixteen members of the LEA2.1 were presented with the RsLEA_55 protein, containing M2.1, M2.2, M2.3, M2.5, and M2.6 organised in two successive β-barrel domains at the C-terminus and N-terminal random coil (
Figure 5). For all members of the LEA2.3, LEA2.4, and LEA2.5 protein family subgroups, a hydrophobic TMH followed by a globular β-barrel structural domain was shown on the example of RsLEA_211 (
Figure 5). The difference in the structures of the proteins belonging to the mentioned subgroups was related to the N-terminal random coil, whose length varied in relation to the whole protein sequence length. In addition, the LEA2.2 protein subgroup members, represented by RsLEA_275, also folded into a β-barrel structural domain at the C-terminus and N-terminal α-helix, composed of 20 residues, similar to the shorter members of the LEA2.3–2.5 subgroups (
Figure 5). In contrast to these proteins, in the RsLEA_275 protein, this α-helix was amphipathic, composed of a hydrophobic face and more polar residues, resulting in a net charge of +3, due to the presence of four lysin, one arginine, one glutamate, and one aspartate residue.
Besides the LEA1 and LEA2 protein family groups, a good correlation between the presented results and constructed 3D models was also obtained for the dehydrins, SMPs, LEA4, and LEA5 groups (
Figure 5). High disorder and random coil propensities were characteristic for dehydrins, evidenced by the 3D model of the representative RsLEA_139, and obtained higher predicted alignment error (PAE) values (
Supplementary Figure S12). Structural differences within the
R. serbica SMPs were illustrated by two representatives, a shorter RsLEA_66 containing only the M7.1 motif, compactly folded into one globular domain composed of all three secondary structure elements and a longer RsLEA_71 containing both the M7.1 and M7.2 motifs, and the N-terminal random coil. The exceptionally high propensity for folding into an α-helical conformation, particularly an A-type α-helix (HeliQuest, data not shown), was demonstrated for the
R. serbica LEA4 protein members, e.g., RsLEA_188 and RsLEA_301 (
Figure 5).
An almost equal distribution of α-helices and coils, with a very low percentage of β-sheets and the absence of a globular domain, was confirmed for the LEA5 family members represented by the RsLEA_202 protein. As a representative of the
R. serbica LEA3 protein family group, RsLEA_80 mostly folded into a random coil and showed a high PAE value, implicating a significant disorder propensity (
Figure 5).