Identiﬁcation and Characterization of Glycine-and Arginine-Rich Motifs in Proteins by a Novel GAR Motif Finder Program

: Glycine-and arginine-rich (GAR) motifs with different combinations of RG/RGG repeats are present in many proteins. The nucleolar rRNA 2 (cid:48) -O-methyltransferase ﬁbrillarin (FBL) contains a conserved long N-terminal GAR domain with more than 10 RGG plus RG repeats separated by speciﬁc amino acids, mostly phenylanalines. We developed a GAR motif ﬁnder (GMF) program based on the features of the GAR domain of FBL. The G(0,3)-X(0,1)-R-G(1,2)-X(0,5)-G(0,2)-X(0,1)-R-G(1,2) pattern allows the accommodation of extra-long GAR motifs with continuous RG/RGG interrupted by polyglycine or other amino acids. The program has a graphic interface and can easily output the results as .csv and .txt ﬁles. We used GMF to show the characteristics of the long GAR domains in FBL and two other nucleolar proteins, nucleolin and GAR1. GMF analyses can illustrate the similarities and also differences between the long GAR domains in the three nucleolar proteins and motifs in other typical RG/RGG-repeat-containing proteins, speciﬁcally the FET family members FUS, EWS, and TAF15 in position, motif length, RG/RGG number, and amino acid composition. We also used GMF to analyze the human proteome and focused on the ones with at least 10 RGG plus RG repeats. We showed the classiﬁcation of the long GAR motifs and their putative correlation with protein/RNA interactions and liquid–liquid phase separation. The GMF algorithm can facilitate further systematic analyses of the GAR motifs in proteins and proteomes.


Introduction
Eukaryotic protein-coding genes may acquire some novel features to accommodate the encoded proteins in the enormously large and complicated cellular environment. Fibrillarin (FBL), an rRNA 2 -O-methyltransferase (MTase), has a long N-terminal GAR domain with diverse combinations of arginine-glycine-glycine (RGG) or arginine-glycine (RG) repeats in eukaryotes but not Archae [1]. On the other hand, the C-terminal MTase domain of FBL is highly conserved in archaea and eukaryotes. FBL is an abundant nucleolar rRNA 2 -O-methyltransferase (MTase) using the guide box C/D small nucleolar RNAs (snoRNAs) to recognize the target sites in rRNA [2]. A nucleolar localization signal appears to be present in the GAR domain of human FBL [3]. Arginine methylation in the GAR domain can regulate the nuclear and nucleolar localization of FBL [1]. Thus, long GAR domains in eukaryotic FBL might be at least partially correlated with novel sub-compartmentation requirements in eukaryotic cells. Due to its critical roles in ribosome formation and protein synthesis, FBL is involved in tumorigenesis and viral infections and can be considered as a therapeutic target [4,5].
Membrane-less organelles (MLOs) or numerous RNP bodies organize and function through the liquid-liquid phase separation (LLPS) mechanism to coordinate thousands of simultaneous molecular reactions spatiotemporally [6]. Nucleoli, the most prominent usually refer to extra-long GR/RGG-repeat-containing sequences for more than 50 amino acid residues as GAR domains and combinations of RG/RGG repeats within a limited length as GAR motifs. We are interested in the long GAR domain in FBL through evolution and the existence of similar long GAR domains in other proteins. We also would like to characterize the GAR motifs and domains in different proteins. We thus developed a GAR MOTIF FINDER (GMF) program to identify the motifs in proteins and further demonstrate the characteristics of GAR motifs. By GMF, we showed our analyses of the GAR domain of FBL and the other two nucleolar proteins as well as the GAR motifs in other typical known RGG/RG-containing proteins. We then analyzed the whole human proteome for all the long GAR-motif-containing proteins.

Program Coding of GAR Motif Finder (GMF)
This program was written in Python with the graphic user interface (GUI) constructed by tkinter, and matplotlib integrated into the GUI to show the pie chart. It is available at https://mega.nz/file/icYyiRzT#IKknEik5PojpaxoK23zMKEXRXIglAGKN9 IUCeXqAjPY (accessed on 16 January 2023) for download. The GMF program allows users to examine and locate the GAR motifs in target protein sequences. The pattern G(0,3)-X(0,1)-R-G(1,2) -G(0,3)-X(0,5)-G(0,3)-X(0,1)-R-G(1,2) is employed to find the GAR motif with multiple RG/RGGs that might be interrupted by long flexible G-rich tracts or some other amino acids. X indicates any amino acid residue. The numbers in the parentheses separated by a comma are the minimum and maximum times the residues can repeat at the position. The input sequences must be in the FASTA format. Single or multiple sequences can be analyzed each time. The text results shown in the information window include the accession numbers (names) of the input sequences, the position of the motif in the sequence, the motif sequence (pattern), the numbers of RG or RGG repeats, the non-G or R (else) amino acids in the motifs, the percentage coverage of the GAR motif in the polypeptide, the G to R ratios, the percentages of G, R, and other amino acids in the motifs, and the complete entry sequences with GAR motifs bracketed. If more than one sequence is in the input, at the end of the text, the total statistics show the number of input sequences, the number of sequences with GAR motifs, the percentage of the input entries with GAR motifs, and the number of total RG, RGG, and non-GR amino acids in the motifs. The results can be exported as a .txt file or can output as a table containing the key information as a .csv file. In addition, the graph window can show the pie chart of the RG/RGG percentage as well as the bar graph of the amino acids other than R and G in all the GAR motifs of the input sequences.

Analyses of the N-Terminal GAR Domain of Fibrillarin in Different Model Organisms
To develop an algorithm to analyze the GAR motifs, we need to characterize the critical parameters of the motif first. The GAR domain at the N-terminus of FBL has been well studied and should be an excellent example to extract the key information. We retrieved and analyzed the FBL sequences from evolutionarily distant model organisms: yeast (Saccharomyces cerevisiae), nematodes (Caenorhabditis elegans), and fruit flies (Drosophila melanogaster). We also analyzed FBL from five vertebrate species, including a bony fish (zebrafish, Danio rerio), an amphibian (clawed frog, Xenopus tropicalis), a reptile (green anole, Anolis carolinensis), and two mammals (mouse, Mus musculus; humans, Homo sapiens) for conservations and variation in a specific lineage. We loosely defined the GAR domain from the first RGG to the last RG(G) sequence before the conserved EPHR sequences. The alignment is shown in Supplementary Figure S1.
To show the key features of these domains, we calculated and summarized in Table 1 the lengths of the GAR domain/the full-length FBL, G/R ratios, and percentage of G (G%), R (R%), and non-GR amino acids (non-RG%) in the domains. The lengths of the domains vary from 72 to 107 in the FBL orthologues. Other amino acids might be directly before but are not directly after R in the domain. The G/R ratios vary from 2.47 (yeast) to 4.73 (fruit fly), mostly around 3. Thus, most of the Rs in the domain are followed by two Gs as RGG. All Rs in the GAR domains are followed by at least one G as RG (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13) . The R% within the domain is usually around 20%, while the G% varies greatly in different FBL orthologues. Fly FBL contains five RG 4 and RG 5,6,7,10,12 , resulting in the highest G% (77.2%). Table 1 also lists the numbers and percentages of the non-GR amino acids. Other intervening amino acids are different between species, but phenylalanine (F) is most frequently detected. The listed features in Table 1 reflect the variances and similarities of these domains.

Development of the GAR MOTIF FINDER Program for Analyses of Long GAR Domain in FBL
Though we analyzed the GAR domain in the FBL proteins manually in the previous section, it is inefficient to conduct similar analyses to more target sequences. We then tried to develop an algorithm GAR MOTIF FINDER (GMF) to search for the motifs containing repetitive RG or RGG elements and conducted some pilot analyses with the FBL sequences. After adjustments with different GAR motif pattern combinations, especially to include most of the extra-long polyglycine (polyG) sequences and short non-RG segments in some GAR domains of FBL, we defined the GAR motif as G(0,3)-X(0,1)-R-G(1,2) -G(0,3)-X(0,5)-G(0,3)-X(0,1)-R-G (1,2). This pattern limits the identified motif to have at least two elements of either RG or RGG and allows the flexibility with polyG tracts (up to 13 Gs in straight) between the arginine residues. It can also accommodate at most six other intervening amino acid residues between RG/RGG elements. The X(0,1)-R-G(1,2) module considers the multiple FRGG, SRGG, ARGG, DRGG, or PRGG motifs that frequently occur in the GAR domain of FBL. Figure 1 shows the interface window of GMF. Users can input the sequences to be analyzed singly or in batches by uploading the .txt files with the sequences to be analyzed in FASTA format. Text results are shown in the window or can be exported to the same folder of the sequence file. The result can also be exported as an excel table (.csv file). The outputs, whether in text or in table, include features in Table 1 plus the sequence pattern of the motif, the position (numbers of the start and end residues) of the motif, the percentage of the motif in the full-length polypeptide, and the numbers of RG or RGG repeats in order to provide more information to evaluate the motif. Genes 2023, 14, x FOR PEER REVIEW 5 of 17 Figure 1. The interface of GMF. Red box  is the information window to show the events of any operation. Red box  is the function button panel. Red box  is the graphic window to display the pie or bar charts. Red box  is the check box to select the data to be analyzed.

Analyses of the GAR Domains in the Three Nucleolar Proteins by GMF
In addition to FBL, two nucleolar proteins, NCL and GAR1, also have long GAR domains of about 50 amino acids and are classified as tri-RGG proteins by Thandapani et al. [8]. To show the power of the program to analyze multiple sequences, we retrieved sequences of the three nucleolar proteins FBL, NCL, and GAR1 from five different vertebrate model organisms, including zebrafish, clawed frog, green anole, mouse, and humans, to characterize their GAR domains by GMF. Features of the GAR domains in these proteins (vFNG) from the GMF analyses are shown in Table 2 (direct .csv output by GMF in Supplementary Table S1 and .txt output in Supplementary File S1). Minor differences in the FBL features in Table 1 and Table 2 should be due to slightly different domain definitions. The GAR domains in the vFNG proteins generally are long with more than 10 RGG plus 0-4 RG elements. The GAR domains of FBL are longer than that of NCL or GAR1 and contain the highest repeat numbers of RGG plus RG. The percentages of the motifs are about 20-24% for FBL, 6-8% for NCL, and 22-28% for the N-terminal as well as the C-terminal GAR domains of GAR1, reflecting the differences in the full lengths of these proteins. Of the 15 vFNG entries, only the GAR domain in zebrafish FBL is separated into two GAR motifs (residues 7-38 and 50-79) by the defined pattern of GMF. As the identified motifs are bracketed in the full-length sequence at the end of the entry in the text report, the interrupted G-rich sequence FGGGFKSPGGE between the two motifs can be easily identified. There is one more short ERGGGGRG motif identified by GMF in zebrafish NCL, about 90 amino acid residues upstream of the long C-terminal GAR domain. There is only one RGG element at similar positions in other NCL sequences, thus it would not be indicated by GMF.
As shown in Table 2, the G/R ratios in these GAR domains are from 2.12 to 4.38, mostly over 2.5, indicating that usually there are Gs between the RGG/RG repeats. The non-GR% are generally below 20%. The GAR domains of NCL orthologues have the highest purity of "RGG" as they consist of 9-12 RGGs but no or only one RG. Furthermore, the GAR domains of NCL have high G/R percentages, and F is almost the only intervening amino acid. The two GAR domains of GAR1 account for about 50% of this short polypeptide. It is interesting that from fish to amphibian, reptiles, and mammals, the N-terminal GAR (GAR-N) domains become longer and more diverse than the C-terminal ones. On the contrary, the C-terminal GAR domains (GAR-C) become shorter and contain only

Analyses of the GAR Domains in the Three Nucleolar Proteins by GMF
In addition to FBL, two nucleolar proteins, NCL and GAR1, also have long GAR domains of about 50 amino acids and are classified as tri-RGG proteins by Thandapani et al. [8]. To show the power of the program to analyze multiple sequences, we retrieved sequences of the three nucleolar proteins FBL, NCL, and GAR1 from five different vertebrate model organisms, including zebrafish, clawed frog, green anole, mouse, and humans, to characterize their GAR domains by GMF. Features of the GAR domains in these proteins (vFNG) from the GMF analyses are shown in Table 2 (direct .csv output by GMF in Supplementary Table S1 and .txt output in Supplementary File S1). Minor differences in the FBL features in Tables 1 and 2 should be due to slightly different domain definitions. The GAR domains in the vFNG proteins generally are long with more than 10 RGG plus 0-4 RG elements. The GAR domains of FBL are longer than that of NCL or GAR1 and contain the highest repeat numbers of RGG plus RG. The percentages of the motifs are about 20-24% for FBL, 6-8% for NCL, and 22-28% for the N-terminal as well as the C-terminal GAR domains of GAR1, reflecting the differences in the full lengths of these proteins. Of the 15 vFNG entries, only the GAR domain in zebrafish FBL is separated into two GAR motifs (residues 7-38 and 50-79) by the defined pattern of GMF. As the identified motifs are bracketed in the full-length sequence at the end of the entry in the text report, the interrupted G-rich sequence FGGGFKSPGGE between the two motifs can be easily identified. There is one more short ERGGGGRG motif identified by GMF in zebrafish NCL, about 90 amino acid residues upstream of the long C-terminal GAR domain. There is only one RGG element at similar positions in other NCL sequences, thus it would not be indicated by GMF. As shown in Table 2, the G/R ratios in these GAR domains are from 2.12 to 4.38, mostly over 2.5, indicating that usually there are Gs between the RGG/RG repeats. The non-GR% are generally below 20%. The GAR domains of NCL orthologues have the highest purity of "RGG" as they consist of 9-12 RGGs but no or only one RG. Furthermore, the GAR domains of NCL have high G/R percentages, and F is almost the only intervening amino acid. The two GAR domains of GAR1 account for about 50% of this short polypeptide. It is interesting that from fish to amphibian, reptiles, and mammals, the N-terminal GAR (GAR-N) domains become longer and more diverse than the C-terminal ones. On the contrary, the C-terminal GAR domains (GAR-C) become shorter and contain only RG (1)(2)(3)(4)(5)(6) repeats interrupted by F. Among all GAR domains of vFNG, the GAR-N domains of GAR1 in green anole, mouse, and humans, the three amniotic species, have the highest percentage (~23-24%, mostly F, N, P, and S), while the GAR-C domains contain the lowest (7-8%) percentage of non-GR amino acids.
The majorities of the Rs in these GAR domains of vFNG are as RGG but not RG (in total 210 RGG/38 RG). Besides showing the identities and numbers of non-GR amino acids of each GAR motif, GMF can sum and output the distribution of non-GR amino acids in all GAR motifs identified in the search as bar graphs. We input vFBL, vNCL, and vGAR1 separately and the overall non-GR amino acid distributions are as shown in Figure 2A. F is the most abundant one in these domains. FBL orthologues have the longest GAR (with 19 RG/62 RGG and 27 F) and the most diverse composition with 10 other interspersed amino acids in total. The GAR domains of NCL basically are composed of only G, R, and F (with 5 RG/51 RGG and 22 F). The two GAR domains of GAR1 orthologues (with 14 RG/97 RGG, 49 F) have eight non-GR amino acids. The degree of variations of the length, RGG/RG distribution, or the percentage of G, R, and non-GR amino acids of the GAR domains of any FNG protein orthologues in vertebrate species basically are correlated to their evolutionary distances.

Analyses of Other High RG/RGG-Repeat-Containing Proteins by GMF
As shown above, all three nucleolar proteins contain more RGG than RG repeats and have multiple Fs. We were interested whether GAR motifs in other proteins also share these features. We used GMF to analyze 17 human proteins in the list of RG/RGG-repeatcontaining (hRG/RGG) proteins of a previous study [25] and compared the results with those of the three nucleolar proteins.
Interestingly, though multiple RG or RGG elements are present in hnRNPA2/B1, they are discontinuous and cannot be defined as GAR motifs by GMF. GMF analyses of the other 16 proteins are shown in Table 3 (direct .csv and .txt output by GMF in Supplementary  Table S2 and Supplementary File S2). In total, 43 GAR motifs are identified in these proteins. There might be more than one GAR motif in one protein and the motifs might scatter throughout the polypeptides. Most motifs in some proteins such as the FET family members (FUS, EWS, and TAF15) and Lsm14a (RAP55A) are RGG-rich, while motifs in other proteins such as KHDR1 (Sam68), caprin-1, and kmt2b are RG-rich. Other RNAbinding proteins in the list such as SERBP1, hnRNPA1, hnRNPU, DDX4, G3BP1, FMR1, and FXR1 contain motifs with RG and RGG of similar levels. Moreover, different motifs in any single protein might be RG-or RGG-rich. For example, the first GAR motif in SERBP1 has 6 RGs/2 RGGs, but the second one contains 2 RGs/4 RGGs. The total RG/RGG numbers of all GAR motifs in these 16 proteins are 112/115, and 9 of the 16 proteins were classified as tri-RGG, 3 as di-RGG (FMRP, FXR1, and Sam68), 2 as tri-RG (DDX4 and Caprin-1), and 1 as di-RG (G3BP1) by Thandapani et al. [8], which are shown by the color codes in Table 3. The G/R ratios in different motifs range from 1.0 to 4.5. The non-GR % range from 0 to more than 50%. Proline (P) is the most abundant non-RG amino acid after summing up all GAR motifs in these hRG/RGG proteins, and D, S, N, and Y are also widely distributed ( Figure 2B). F is not the most abundant non-GR amino acid in the GAR motifs of these proteins. RG/62 RGG and 27 F) and the most diverse composition with 10 other interspersed amino acids in total. The GAR domains of NCL basically are composed of only G, R, and F (with 5 RG/51 RGG and 22 F). The two GAR domains of GAR1 orthologues (with 14 RG/97 RGG, 49 F) have eight non-GR amino acids. The degree of variations of the length, RGG/RG distribution, or the percentage of G, R, and non-GR amino acids of the GAR domains of any FNG protein orthologues in vertebrate species basically are correlated to their evolutionary distances.

Analyses of Other High RG/RGG-Repeat-Containing Proteins by GMF
As shown above, all three nucleolar proteins contain more RGG than RG repeats and have multiple Fs. We were interested whether GAR motifs in other proteins also share For the GAR motifs listed in Table 3, few are longer than 50 amino acids with more than 10 RG plus RGG repeats, such as the long GAR domains in FNG. Specifically, the longest or the fourth GAR motif in EWS and TAF15 contain 2RG/10RGG (58 residues) and 11 RGG repeats (82 residues), respectively. The longest GAR motif in CHTOP is of 53 residues with 10 RG/5 RGG, and in Lsm14a, it is of 45 amino acids with 4 RG/8 RGG. Within them, the long GAR domains of EWS and Lsm14a are close to the ones in FNG, with multiple RGG but few RG, several Fs in the motifs, and a non-GR % of about 24%.
The FET family proteins are structurally related but functionally different RNA binding proteins [26]. They all contain SYGQ -rich sequences at the N-terminus and separated GAR motifs of various lengths after this region. GMF identified five GAR motifs in EWS and TAF15 and four in FUS, distributed from the middle to close to the C-terminus of the polypeptides. There are repetitive PGG elements between GAR motif 2 and 3 and also motif 4 and 5, in EWS. The longest GAR, motif 4 in TAF15, is not disrupted by such elements but contains 10 GGYGGD repeats between the RGGs, making this motif much longer than the longest motif in FUS and EWS. GAR motif 4, the longest one in FUS with 34-amino acids consisting eight RGGs and one RG, is shorter than those in EWS and TAF15.   The overall RG/RGG repeat numbers of hFET are 21/59. The RG/RGG numbers of the longest motifs in these proteins are 3/29, indicating the longest motifs have strong RGG preferences. Though the high RGG preference is similar to hFNG (9 RG and 37 RGG in total), the distribution patterns of non-GR amino acids in the GAR motifs in each group are distinctive. While FNG all contain multiple Fs, F is not the major non-GR amino acid in any single motif of the FET proteins. The longest motif in TAF15 has 10 D(S)RGG(G) YGG repeats with 11 Ds and 10 Ys, making D the most frequent non-GR amino acid in hFET followed by Y ( Figure 2C). D also locates in some short GAR motifs in the FET proteins, but Y is restrictively distributed in TAF15. On the other hand, P and M are frequently encountered in EWS. Therefore, the longest GAR domains in FET proteins are different from those of the three nucleolar proteins. They are even variable within the FET group, though they all are long and with repetitive RGGs.

Analyses of Extra-Long GAR Motifs in Human and Other Proteomes
As the GMF program was coded for identification and characterization of extra-long GAR motifs in proteins, we then used GMF to analyze the human proteome (GRCh38.p13 from Genome Reference Consortium) for all extra-long GAR domains. Many proteins in the human proteome are isoforms due to alternative splicing and some GMF hits might have dozens of isoforms. We sorted and inspected the GMF results in the .csv file and identified 21 motifs in 161 protein isoforms encoded by 18 genes with the RG/RGG repeat numbers higher than 10. A summary of the GMF analyses of the 21 long GAR motifs is shown in Table 4, and the numbers of isoforms containing the motifs are indicated (.csv GMF output of the isoforms and all motifs in the proteins are in Supplementary Table S3). We found that in the human proteome, the number of RG plus RGG repeats in one single GAR motif is 16 (11 RG plus 5 RGG repeats) at most. Two paralogous proteins, heterogeneous nuclear ribonucleoprotein Q (hnRNPQ) and hnRNPR, contain such motifs. Interestingly, different isoforms of hnRNPQ have variations at the C-terminus of the longest GAR domain, resulting in 11/5, 10/3, and 9/4 of the RG/RGG repeat numbers.
Besides hnRNP R and Q, there are 10 other proteins containing more RG than RGG repeats. Zinc finger CCCH domain-containing protein 4 (ZC3H4) is the only protein with two motifs that have RG/RGG repeats of more than 10. The first motif contains 13 RGs and 1 RGG and the second one has 9 RGs and 2 RGGs. Methyl-CpG-binding domain protein 2 (MBD2) has a GAR motif with 12 RGs and 2 RGGs. Though the percentages of G, R, and non-GR amino acids of the three motifs are similar, the identity of non-RG amino acids between proteins vary greatly. Myosin XVB contains 5 GAR motifs and the longest one has 11 RG repeats. Some target proteins such as myosin XVB and Bromodomain and WD-repeat-containing protein 3 (BRWD3) (9 RGs and 4 RGGs) are very long, with the longest GAR domain accounting for only about 2% of the protein. The long motifs in RNAbinding protein 26 (RPM26) and histone-lysine N-methyltransferase EHMT2 both account for about 3% of the protein, containing 10 RGs and 9 RGs plus 1 RGG, respectively. On the contrary, the GAR domain with 10 RG and 5 RGG repeats of the chromatin target of PRMT1 protein (CHTOP, also shown in Table 3) accounts for more than one-fifth of the protein.
The GAR motif may simply contain only RG repeats without any interrupting non-RG amino acids such as the one with 9 RGs and 1 RGG in zinc finger protein 579 (ZNF579). On the contrary, it may contain more complicated tandem repeats such as the GAR motif in initiation factor 3 subunit A (eIF-3A) with 5 PRRGL/(M)DDDRG repeats (11 RG repeats separated by multiple Ds). The only ribosomal protein on the list is 40S ribosomal protein S2 (RPS2) with 8 RGs and 3 RGGs.
The six proteins containing the long GAR domains with more RGG than RG repeats are the three FNG proteins EWS, TAF19, and Lsm14a that we described in previous sections. Therefore, long GAR domains are rare in the human proteome, especially the ones with more RGG than RG repeats.

Discussion
The rRNA 2 -O-methyltransferase FBL is a major nucleolar protein for rRNA processing. We analyzed the N-terminal GAR domains in FBL from evolutionarily distant model organisms and showed their features (Tables 1 and 2). The long GAR domains contain more than 10 RGGs but few RG repeats connected with multiple Gs or some other amino acids, mostly F, for longer than 50 amino acids. Based on the GAR domain in different FBL orthologues, we defined the GAR motif with G(0,3)-X(0,5)-G(0,3)-X(0,1) between R-G(1,2) to allow multiple Gs surrounding the arginine residues and a few other amino acids between each RG/RGG sequences. We developed the GMF algorithm to facilitate the characterization and comparison of GAR domains containing repetitive RG or RGG elements in different proteins. Our criteria for GAR motif is more relaxed and flexible than the tri-RGG[RGG(X 0-4 )RGG(X 0-4 )RGG], di-RGG[RGG(X 0-4 )RGG], tri-RG[RG(X 0-4 )RG(X 0-4 )RG], or di-RG[RG(X 0-4 )RG] motifs defined previously [8]. This pattern is optimized to accommodate some long G arrays or a short segment of non-GR amino acids in the GAR domain in FBL. Though the main goal of the GMF program is to find and characterize long GAR motifs with repetitive RG or RGG repeats, segments with partial match of the pattern as short as four amino acid residues (RGRG) or a long motif extended over the pattern can all be identified.
GMF analyses of FBL as well as two nucleolar proteins NCL and GAR1 in five vertebrate species showed the potential of the program to help inspect the domains through evolution. In Table 2, zebrafish is the only species with the long GAR domain of FBL split in the middle, even under the relaxed GAR definition of GMF. Zebrafish NCL also has an extra short GAR motif not identified in others. Analyses of these proteins in other bony fish and vertebrate species might reveal if the differences are common in other fish. We also noticed a reverse length trend for the two GAR domains in GAR1. While the GAR-N domains become longer and more diverse, the GAR-C domain becomes shorter from fish to amniotic species, thus maintaining a balanced total length.
Besides the FNG proteins, we analyzed other typical human RG/RGG-repeat-containing proteins listed in the review by Chong et al. [25]. GMF analyses showed that GAR motifs of these proteins basically are different from the GAR domains of the nucleolar proteins. For example, the long GAR domains of the three nucleolar proteins are either at the N-terminus (FBL), the C-terminus (NCL), or both (GAR1), but other RG/RGG-repeat-containing proteins might contain one or a few GAR motifs with different lengths at various positions of the polypeptides. The features of the extra-long GAR domains of the three nucleolar proteins include numerous RGGs but few RGs, and multiple Fs but fewer other amino acids. However, the percentage of F in the GAR motifs of other RG/RGG-repeat-containing proteins is low, while P, D, and S are more frequently encountered (Table 3 and Figure 2B,C). Specifically, we compared the long GAR domains in FNG with those in FET proteins. Though they all are long with repetitive RGGs, the non-GR amino acids are different. For each FET protein, the non-GR amino acids in the long and other short GAR motifs are also different. In addition, the GAR domains of the three nucleolar proteins are either at the N-terminus (FBL) or the C-terminus (NCL), or both (GAR1), but in FET, there are 3-4 short GAR motifs distributed in the proteins and 1 long GAR domain at the C-terminal part.
Identifications of long GAR domains in NCL or FBL are connected with arginine methylation in these domains [10,11]. GAR1 is also modified by arginine methylation [12,13]. Experimental manipulation of FBL showed that no matter whether the GAR domains are randomized, truncated, or extended, they can self-associate for pre-rRNA sorting and processing, with natural GAR length optimal [7]. Swapping either of the two GAR domains from GAR1 to FBL can complement the function of FBL for pre-rRNA sorting and processing [7]. Though the long RG/RGG core in FBL, NCL, and GAR1 is conserved from evolutionarily distant species, the exact sequences or lengths are not conserved, indicating they are not critical. Maintenance of the GAR domains in these proteins thus is likely to be related to the essential modification. Methylation of the guanidino nitrogens of arginine residues reserves the positive charge, yet increases steric hindrance, and lost hydrogen bonds might lead to modified interactomes and phase separation [25]. Phenylalanine is frequently distributed in the GAR domains of FNG in the GFRG or GFG context. Due to the hydrophobicity and π electrons, phenylalanine has been used as a methylarginine mimic in some studies [27][28][29][30][31]. It is possible that in the extra-long GAR domains of the nucleolar proteins, F might function like a constitutive methylarginine in the context. Dispersed F in the GAR domains may allow a flexible range of arginine methylation levels to meet the basal requirement in these proteins. Consistent with the hypothesis, F in the GAR domain of NCL has been reported to be important for G-quadruplex binding and folding [32].
The RG/RGG-repeat-containing proteins analyzed in Table 3 are also methylated at the arginine residues in the GAR motifs [25], but P, D, and S become the major non-GR amino acids. Amino acid residues interspersed in the GAR domain can affect arginine methylation. Proteomic analyses showed that P is second to G at the +1 or −1 positions of methylarginines [33], consistent with the high frequency of P in the GAR motifs. Amino acid residues interspersed in the GAR domain can affect arginine methylation. Negatively charged amino acids such as aspartate block methylation of neighboring arginine residues [18]. Though both FUS and EWS are heavily arginine methylated, negatively charged aspartic acids neighboring R residues interfere with PRMT binding and reduce methylation in the GAR domain of another FET family member, TAF15 [26]. Similarly, phosphorylated serines mimic aspartates and can reduce neighboring arginine methylation [18]. Spontaneous deamidation of asparagine or glutamine residues can result in negatively charged residues and might also modulate arginine methylation in the domain. It is thus critical to analyze the presence of different intervening amino acids that might adjust the arginine methylation of the GAR domain.
Most of the GAR-motif-containing proteins listed in this study are components of different MLOs or RNPs. Different GAR motifs, together with other regions in the proteins, determine their subcellular distribution. Besides ionic interaction, the guanidino groups can also form π stacking and cation-π stacking. Glycine is the smallest and most flexible amino acid to tolerate a wide range of backbone arrangements. The unordered, extended, and flexible RGG/RG-rich motifs can provide multivalency for intra-or inter-molecular interactions for phase separation [25]. Arginine methylation of the GAR motifs can affect LLPS of the proteins for MLOs assemblies [19]. One of the best studied example is that of arginine methylation in the long GAR motif of FUS, which can reduce the cation-π interaction of the motif with the N-terminal low-complexity domain, preventing phase separation and further amyloid formation [34]. Arrangements of the RGG and RG repeats as well as other amino acids with different chemical properties in the GAR motifs should be critical for LLPS to determine the distribution of these proteins, and may affect the pathogenesis of neurological diseases, cancers, and viral infections.
We also explored the whole human proteome using GMF for more long GAR-motifcontaining proteins. The six long GAR domains in FBL, NCL, GAR1 (GAR-C), EWS, TAF15, and Lsm14a with more RGG than RG repeats have all been analyzed in Tables 2 and 3, indicating the small pool of this type of long GAR domains in the human proteome. FBL, NCL, GAR1 are nucleolar proteins. Lsm14a is a component of cytoplasmic processing bodies. EWS and TAF15 (as well as FUS) can localize to cytoplasmic stress granules and paraspeckles and pathologically accumulate as cytosolic inclusions in patients with amyotrophic lateral sclerosis (ALS) and frontotemporal lobar degeneration (FTLD) [35]. The long GAR domains, together with other parts in in these proteins, can lead to the LLPS of these proteins in specific MLOs. Paralogous hnRNP proteins hnRNP R and hnRNP Q both contain the GAR domain with the highest RG/RGG repeat numbers identified by GMF. Relaxed GMF motif definition results in the continuation of the domain because some RG or RGG repeats are separated by 5-6 non-GR amino acids. Arginine methylation of RGG box-containing hnRNP proteins, including hnRNP R and Q, accounts for 65% of nuclear ADMA levels [15]. Methylation of multiple arginine residues by PRMT1 in the long GAR motif of hnRNP Q has been reported [36]. Both hnRNP Q and R also accumulate in pathological inclusions in FTLD with FUS [37].
Among the 10 other proteins containing more RG than RGG repeats, 3 are nuclear proteins containing zinc fingers. ZC3H4, containing two long GAR motifs, is an RNA-binding protein localized at chromatin to suppress transcription of non-coding RNAs [38]. Both motifs are before the three continuous C3H1 zinc finger motifs at the N-terminal half of the protein. RBM26 is an RNA binding protein with one C3H1 zinc finger and two RNA binding motifs (RRM). ZNF579 has eight C2H2 zinc fingers and the GAR motif is specific in that nine RG and the final RGG repeats are without any interrupting non-RG amino acids. PRMT5 has a "GRG" substrate preference and the GAR motif in ZNF579 is modified by symmetric di-methylation [39].
A few other proteins show chromatin association and/or are related to epigenetic modification. CHTOP can bind to PRMT1, as its full name is "chromatin target of PRMT1 protein" [40], and promote methylation of arginine 3 of histone H4 (H4R3). Its binding to 5-hydroxymethylcytosine (5hmC) can help to recruit the CHTOP-methylosome complex to specific sites on the chromosome for selective gene activation [41]. Methyl-CpG-binding domain protein 2 (MBD2) is a component of the MeCP1 complex that contains HDAC1 and HDAC2 [42]. BRWD3 is a nuclear protein with eight WD repeats at the N-terminal half and two Bromo domains at the C-terminal part, and the GAR motif is near the C-terminal end. Histone-lysine N-methyltransferase EHMT2 with the GAR motif right at the N-terminus is a set domain H3K9 methyltransferase [43]. These proteins then might modulate posttranslational modification (CHTOP and MBD2), or are writers to put on the modification (EHMT2), or are readers for modified bases in DNA (CHTOP and MBD2) or PTM (BRWD3). LLPS can also explain sub-chromatin structure formation and transcriptional control. For example, MBD2 can induce clustering of pericentric heterochromatin and is critical for chromocenter structure [44]. These GAR-motifs with strong RG preference might play more roles in chromatin sub-compartmentation.
Two cytosolic RG-rich proteins are involved in translation. The long GAR motif in initiation factor 3 subunit A (eIF-3A) contains 5 tandem PRRGL/(M)DDDRG repeats within a region with 25 approximate tandem repeats. This specific long GAR motif with RG repeats separated by multiple negative-charged Ds is also a mixed charge domain (MCD). MCDs with multiple RD repeats promote nuclear speckle condensation [45]. Whether and how this GAR/MCD plays in the regulation of the eIF3 complex is an interesting issue. The N-terminal long GAR motif of RPS2 is methylated by PRMT3 [46,47].
In summary, GMF can show the pattern, the position, the numbers of RG/RGG repeats, the non-GR amino acids in the motifs, the coverage, the G/R ratios, and the percentages of G, R, and other amino acids in the motifs and thus can provide critical information for further evaluation of the motifs in LLPS. The GMF program can be a starting tool to facilitate the analyses of GAR motifs in proteins through evolution as well as to design putative therapeutic targets focusing on the motifs. Further modification of the GMF program, for example, to include the analyses of the FGG/FG repeats and other elements, can improve and expand the application of the program.