Searching for G-Quadruplex-Binding Proteins in Plants: New Insight into Possible G-Quadruplex Regulation

G-quadruplexes are four-stranded nucleic acid structures occurring in the genomes of all living organisms and viruses. It is increasingly evident that these structures play important molecular roles; generally, by modulating gene expression and overall genome integrity. For a long period, G-quadruplexes have been studied specifically in the context of human promoters, telomeres, and associated diseases (cancers, neurological disorders). Several of the proteins for binding G-quadruplexes are known, providing promising targets for influencing G-quadruplex-related processes in organisms. Nonetheless, in plants, only a small number of G-quadruplex binding proteins have been described to date. Thus, we aimed to bioinformatically inspect the available protein sequences to find the best protein candidates with the potential to bind G-quadruplexes. Two similar glycine and arginine-rich G-quadruplex-binding motifs were described in humans. The first is the so-called “RGG motif”-RRGDGRRRGGGGRGQGGRGRGGGFKG, and the second (which has been recently described) is known as the “NIQI motif”-RGRGRGRGGGSGGSGGRGRG. Using this general knowledge, we searched for plant proteins containing the above mentioned motifs, using two independent approaches (BLASTp and FIMO scanning), and revealed many proteins containing the G4-binding motif(s). Our research also revealed the core proteins involved in G4 folding and resolving in green plants, algae, and the key plant model organism, Arabidopsis thaliana. The discovered protein candidates were annotated using STRINGdb and sorted by their molecular and physiological roles in simple schemes. Our results point to the significant role of G4-binding proteins in the regulation of gene expression in plants.


Introduction
G-quadruplexes (G4s) are secondary structures of nucleic acids that can arise in guanine-rich DNA or RNA regions [1]. Each G4 is formed by its basic units, termed guanine tetrads (see Figure 1). A single guanine tetrad consists of four guanine nucleotides interconnected by Hoogsteen base pairing. G4s are further stabilized by the positively charged monovalent ions that are localized in their central cavity [2]. Although there are common properties in G4s, there is also great structural diversity in their composition, including in the number of planes, loop lengths, the orientation of tracts, etc. [3][4][5]. In humans, G4-forming sequences have been found in the genes that are important for key cellular processes [1], and it appears that they play an important role in the development of cancer [7] and neurodegenerative diseases [8,9]. Additionally, G4s may play a role in viral lifecycles [10], including that of novel SARS-CoV-2 [11]. It is clear that the formation of G4 structures itself is not sufficient to support all of its functions, and the binding of specific proteins (including DNA/RNA-binding proteins) often works as a trigger and/or modulator of their effects. Currently, nearly 100 G4-binding proteins are known to be present in humans and other model organisms [12,13].
Although the number of research papers dealing with G4s in humans is exponentially rising, there is still only limited evidence regarding G4s in plants. Recently, genome-wide studies were performed on barley [14] and wheat [15], and the partial knowledge about other agriculturally important plants was reviewed [16]. The G4s were associated with a large range of important molecular processes, such as the regulation of transcription, translation, the response to various types of stresses, and even plant-specific processes (such as flowering and phloem formation) [16]. G4s were also discussed as a possible UV sensor in plants [17], and, lastly, previous research suggested that G4s might be responsible for the formation of the SHORT ROOT RNA phase-separation-like phenomenon in vivo [18].
Even less is known about G4-binding proteins in plants. In 2015, Andorf et al. described nucleoside-binding kinase 1 ZmNDPK1 as the first known G4-binding protein in plants [19]. Cho et al. described the binding of the protein, JULGI, to RNA G4 located in 5'UTR of the SMXL4 and SMXL5 genes inhibiting their translation [20]. Additionally, Sjakste et al. analyzed the sequences of DNAs that were tightly bound to the proteins in barley seedlings [21]. They found that the sequences that were bound by these proteins were highly enriched in GC content, compared to the rest of the barley genome, and CD spectroscopy confirmed that JULGI-bound sequences are able to form G4s [20].
Functional and structural studies of G4-binding proteins in plants are absent, and this interesting topic should be studied. Therefore, we aimed to computationally identify the potential G4-binding proteins in plant genomes, using the experimentally validated RGG region of protein FMRP (which is known to interact with G4s [22]), and the so-called Novel Interesting Quadruplex Interaction (NIQI) motif, which is common in many G4-binding proteins in humans [23].
The main aim of this study was to inspect whether plants (Viridiplantae, both green plants and algae) dispose of the proteins containing RGG and NIQI motifs with ability to bind G4s and, thus, to make a preselection of suitable candidates for further wet-lab testing. For these purposes, we used two distinct bioinformatic tools: BLASTp and FIMO. Lastly, we identified which molecular and physiological processes our protein candidates are involved in.

Identification of G4-Binding Proteins Using the FIMO Approach
Identification of G4-binding proteins in the model plant organism Arabidopsis thaliana was performed using FIMO (https://meme-suite.org/meme/tools/fimo accessed on 26 January 2021) [24], which is part of the MEME Suite [25], with a custom threshold of p = 1 × 10 −9 to minimize false-positive results (default value is 1 × 10 −4 ). As the input for motif scanning, we used reference Arabidopsis thaliana proteome (40 885 proteins from the RefSeq NCBI database [26]). Motif scanning was carried out for both previously identified G4-binding motifs (NIQI-"RGRGRGRGGGSGGSGGRGRG" and experimentally verified RGG motif from the FMRP protein "RRGDGRRRGGGGRGQGGRGRGGGFKG"). Subsequently, we repeated the whole analysis for all members of the Viridiplantae group, comprising both green plants and algae.

Identification of G4-Binding Proteins Using BLASTp Approach
As an alternative method for the comparison of FIMO results, we performed BLASTp (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins accessed on 30 July 2021) [27] searching for both G4-binding sequences separately (NIQI and RGG). At first, search was limited to the organism Arabidopsis thaliana (non-redundant protein sequences database (nr), default parameters), and then the search was extended to the entire Viridiplantae group, comprising both green plants and green algae (nr database).

Functional Interaction Network Analysis Using STRING Approach
Functional interaction networks of proteins containing the RGG [28] or NIQI [23] motif were constructed using the STRING (https://string-db.org/ accessed on 20 February 2021) [29] online tool with the following parameters: disable structure previews in bubbles, hide disconnected nodes in network, show input protein names, k-means clustering with 3 nodes, and other parameters (including interaction score) were set as default.

Identification of G4-Binding Proteins in Plants Using a FIMO Approach
In general, based on the presence of a G4-binding motif, we identified more than 400 proteins with a theoretical potential to bind G4 structures in Arabidopsis thaliana (555 containing the significant RGG motif, and 408 containing the significant NIQI motif). The complete FIMO results can be found in the Supplementary Materials (Files S1 and S2). The functional protein association network (STRING) analysis revealed a high number of functionally interconnected proteins with specific molecular functions, such as DNA topology modifications, DNA/RNA binding, RNA metabolism, and ribosome biogenesis. The results of the STRING analyses are depicted in Figure 2A  The STRING functional enrichment analysis, following the NIQI motif scanning, revealed 93 interconnected proteins with the potential to interact with G4s. These proteins can be divided into three groups: proteins able to bind/interact with DNA or RNA; proteins linked with ribosome biogenesis and maintenance; other proteins. The most interesting proteins from the first group (proteins able to interact with DNA or RNA) were the following: DNA topoisomerase 3α (involved in homologous recombination and torsion stretch relaxation [35]) and the plant-unique protein KAKU4 (responsible for nuclear morphology in plants [36]). In addition, our data also indicate the G4-interacting potential for the following: TF II D-subunit 15b (transcription initiation factor), RNA-binding family protein (RRM/RBD/RNP motifs) responsible for RNA splicing, the serine/arginine-rich splicing factor RSZ21 (mRNA processing [37]), the mediator of RNA pol II-transcription subunit 36a (mediator of RNA polymerase II), protein decapping 3α (translation repressor), and proteins belonging to the Argonaute family (AGO2 and AGO3, which contain the NIQI motif). In the second group of proteins (proteins linked with ribosome biogenesis and maintenance), the most interesting were: periodic tryptophan protein (processing of pre18S ribosomal RNA), ribosomal family proteins-60S acidic family protein, 60S ribosomal protein L17-1, 60S ribosomal protein L19-140S ribosomal protein S12-1, 40S ribosomal protein S2-1, and 40S ribosomal protein S2-3.
The STRING functional enrichment analysis, following the RGG motif scanning, revealed 96 proteins, which can be divided into three groups of proteins as described above. Some of them differed to the results from the NIQI motif search, while others were the same/similar. Additionally, they were involved in processes similar to those described in the paragraph above. Some of the promising G4-binding proteins identified were proteins GRP2 and GRP7. GRP2 is a glycine-rich protein that binds nucleic acids and promotes nucleic acid melting, and also binds to and unwinds RNA, ssDNA, and dsDNA [38]. GRP7 binds to RNAs and DNAs (preferentially to poly U or poly G), and is involved in alternative splicing [39]. Further interesting putative G4-binding proteins were: RSZp22 (RNA nucleocytoplasmic shuttling protein); RNA 2 -phosphotransferase (tRNA splicing); Argonaute family proteins, AGO1, AGO2, and AGO3 (small RNA binding [40]); NUC-L1 (rRNA processing, nucleolus organization [41]); RSZ22a (intron recognition and spliceosome assembly); EI4B1 (translation initiation).
To better understand the mechanistic basis of protein-G4 interactions, we carried out a molecular docking analysis of the two representative G4-binding proteins identified above. We selected RNA binding protein HEN2, which is an important ATP-dependent RNA helicase involved in the degradation of a large number of small non-coding RNAs (e.g., snoRNA and miRNA precursors), incompletely spliced mRNAs, and transcripts produced from pseudogenes and/or intergenic regions [42]. In plants, HEN2 regulates floral organ spacing and identity [43]. The docking of a parallel conformation of G4 RNA (GGGUCGGGUUGGGCGGG) to this protein showed that it suitably fits in an arginine-rich cavity ( Figure 2C), and that the neighboring glycine residues may promote bending to allow the arginine residues to recognize G4. Second, we selected the protein, KAKU4, and docked it into the G4 DNA (GGGGCCGGGGCCGGGGCCGGGG). We found that, here, arginine residues also play an important role in the recognition of G4, particularly of Arg 222, Arg 238, Arg 279, and Arg 282 (see Figure 2D).

Identification of G4-Binding Proteins in Plants Using a BLASTp Approach
To validate the FIMO search results described above, we used another computational algorithm, BLASTp. The top 10 most statistically significant results (according to the Evalue) for Arabidopsis thaliana are shown in Table 1, separately for the NIQI and RGG motif. To predict the G4-binding proteins in non-model plant species, we performed an additional BLASTp search that was not limited to Arabidopsis thaliana, but applied to the whole Viridiplantae group (green plants); the results of which are shown in Table 2. Proteins with known function or process they are involved in comprised e.g. THO com-plex subunit 4a responsible for RNA binding (THO complex balances transcription and mRNA processing [44]). The next important example is the UVRD/REP-type putative helicase. Helicases of this type are responsible for DNA unwinding and require ATP to function [45]. Interestingly, UVRD is already known to be a G4-binding protein in Escherichia coli and Neisseria gonorrhoeae. The next important protein predicted to be G4-binding was Fibrillarin 2, which is involved in rRNA processing and has methyltransferase activity [46]. Further interesting examples of putative G4-binding proteins are: the mediator of RNA polymerase transcription subunit 36a-like, which is involved in rRNA methylation; HEN2 with RNA helicase activity [42]; a helicase-like transcription factor with DNA helicase activity; Keratin Type II cytoskeletal 1-like protein; translation initiation factor IF-2.
To further support our hypothesis regarding the occurrence of G4-binding proteins in plants sensu lato, we extended the BLASTp search to algae. The results are shown in Table 3. With the exception of hypothetical protein sequences, we found many known proteins with G4-binding potential. This included: NER endonuclease; Protein EXPORTIN 1A (XPO1); DNA helicase; HEN2 protein; helicase-like transcription factor; La1 protein homolog (required for embryogenesis in Arabidopsis thaliana, which binds to and protects the 3 poly(U) terminus of nascent RNA polymerase III transcripts [47]); DEAD-box ATPdependent RNA helicase 10; FAD-dependent oxidoreductase; the translation initiation factor. Our analysis also revealed that the G4-binding motif is present in the following interesting proteins that were not in the top 10, but are still significant: GRP2; GRP5; GRP7; GRP8; DNA topoisomerase IA; heat stress transcription factor B-2b (HSFB2B); RSZ22a (serine/arginine-rich splicing factor RSZ22A); glycine-rich RNA-binding protein 2, mitochondrial (GR-RBP2); THO complex subunit 4D (ALY4); DNA topoisomerase 3-alpha (TOP3A); protein decapping 5 (DCP5). The complete BLASTp searching results can be found in the Supplementary Materials (File S5-S10). The analysis of the occurrence of the G4-binding motifs in 10 randomly selected plant proteins showed that these are present mainly in the N' and C' terminal regions (Figure 3). Particularly, in two proteins (ARGONAUTE 3 from Arabidopsis thaliana, and mediator of RNA polymerase II transcription subunit 36a-like from Nicotiana tabacum), the main G4-binding motif was predicted to be located in the N' terminal regions. In seven proteins, the main G4-binding motif was located in their C' terminal ends (e.g., protein la 1 from Chlorella sorokiniana), and only one protein contained a putative G4-binding motif in its central region. This is in correlation with our previous analysis dealing with the known human G4-binding proteins [23], although the evolutionarily and mechanistic explanation of this phenomenon is still lacking. One explanation could be the modular nature of virtually all proteins, suggesting that the arginine-glycine rich motifs were independently acquired during evolution to help organisms deal with G4s. Lastly, we inspected the number of putative G4-binding motifs in green plants and algae (Supplementary Material File S11). According to this analysis, the highest number of G4-binding motifs was present in Chara braunii, the model organism for plant terrestrialization (196 significant hits in total), followed by Zea mays (149 hits), Arabidopsis thaliana (147 hits), and Chlamydomonas reinhardtii (143 hits). In the vast majority of plants, less than 30 G4-binding motifs were found. As some plant genomes were sequenced more in-depth, and due to other biases (quality of genome/proteome annotation), it is difficult to say if the differences observed in the total counts of G4-binding motifs have some physiological relevance. Figure 3. Location of the G4-binding motifs (light green boxes) in 10 randomly selected plant proteins (thick black lines). The shortest protein sequences (THO complex subunit 4a; translation initiation factor IF-2-like; mediator of RNA polymerase II transcription subunit 36a-like) are only approximately 300 aa residues long. The longest protein sequence (helicase-like transcription factor) is over 1700 aa residues long. Except for protein HEN2, all putative G4-binding motifs are located near N' or C' protein termini.

Discussion
When searching for G-quadruplex binding proteins, we found many hypothetical proteins and proteins with unknown functions, but also some relatively well-characterized proteins. One well-characterized protein was the Alba DNA/RNA binding protein, which is involved in, for example, genome packaging and organization, and RNA metabolism [48]. Additional examples were Apoptosis inhibitory protein 5 (API5), which binds RNA/mRNA, and is involved in programmed cell death regulation [49]; and protein Argonaute 3 (AGO3), which binds nucleic acids, and provides RNA silencing and defense responses to viral infections [50]. There were also many proteins belonging to the glycine-arginine-rich family, such as GRP2, GRP5, GRP7, and GRP8. These multifunctional proteins are regulated by abiotic stresses [51,52], and are involved in the early development of Arabidopsis thaliana [53]. As these proteins are known to promote DNA melting, they could also bind and help to resolve/unfold plant G4s.
One of our predicted G4-binding proteins was KAKU4. It is known that KAKU4 plays a role in the modulation of nuclear morphology in Arabidopsis thaliana, and it is likely that KAKU4 acts, directly, as a component of the nuclear lamina-like structure in seedbearing plants [36]. Our results suggest that KAKU4 could also bind G4 structures, and maintain the organization of nucleic acids via these interactions. In the future, it would be beneficial to inspect the G4-binding potential of KAKU4 in vitro (using standard molecular biology techniques as an electrophoretic mobility shift assay (EMSA), circular dichroism spectroscopy, and other approaches).
Interestingly, some of our predicted proteins (or their close homologs) are already known to bind G4s. DExH-box-dependent RNA helicase DExH1 from Arabidopsis thaliana share significant sequence homology with ATP-dependent DNA/RNA helicase DHX36 isoform 1 (also known as DHX36 or RHAU). There is also substantial evidence that human DHX36 binds/unwinds G4s via a non-processive, local strand-unwinding mechanism [54,55]. Another example is the nucleolin 1 protein (NUCL1). The human protein, nucleolin, is known to stabilize the G4 structures folded by the LTR promoter, and silence HIV-1 viral transcription [56]. In future studies, the evolution of glycine-arginine-rich domains/regions could be analyzed. For instance, one of our predicted G4-binding proteins was fibrillarin 2, and this protein is highly evolutionarily conserved. However, in the course of evolution, from archaea to eukaryotes, it acquired an additional N-terminal glycine and argininerich domain [44]. This would suggest that the fibrillarin homologs in the archaea lack G4-binding potential.
G4s in DNA must be fine-tuned and balanced in the process of dynamic G4 stabilizing and resolving [57], ensuring the correct replication of DNA. Considering transcription, G4s also need to be specifically and spatiotemporally induced/resolved, allowing cells to utilize an additional regulatory network layer to regulate gene expression [58] ( Figure 4A,B). In addition, it seems that many of the predicted G4-binding proteins participate in RNA processing pathways, including mRNA translation, splicing, small RNA processing and modification, pre-rRNA/rRNA processing and modifications, and in RNA silencing ( Figure 4C). To facilitate RNA G4s' forming and stabilization, translation repressor protein decapping 3α seems to be a good candidate. (C) Proteins containing G4 binding motif in the context of cellular processes. (D) Predicted G4-binding proteins are involved in many previously described biological functions, comprising temperature, drought, salt stress responses, flowering regulation, and gametogenesis.
Overall, we found G4-binding motifs in both green plants and algae. The majority of them have already described molecular functions, and we organized them into groups by their potential ability to form/induce G4 structures (KAKU4), resolve G4s (DNA topoisomerase 3α), or by their potential to drive biological processes (replication, transcription, RNA processing or translation), as depicted in Figure 4. It is, therefore, likely that G4protein interactions are also of great physiological relevance in plants, as suggested in Figure 4D.
Based on our data, we hypothesize that G4-protein interactions play a significant role in molecular processes, such as apoptosis, RNA processing, silencing, and siRNA gener-ation. On the other hand, these hypotheses should be confirmed directly by G4-protein interaction detection, such as using pull-down assays, as described in [59]. Nonetheless, such studies are limited to in vitro experiments, and the high-throughput detection of all G4-protein interactions in living cells is still nearly impossible.

Conclusions
We predicted more than 400 candidate proteins in green plants (including the model organism, Arabidopsis thaliana) and algae with the potential to interact with G-quadruplexes, using the combination of two in silico approaches (BLASTp and FIMO). Interestingly, these proteins form highly statistically significant functional networks, suggesting their involvement in key molecular and physiological processes, which was confirmed by their annotation. Some of the G4 interacting proteins are common for green plants and algae (for example, DNA/RNA-binding proteins), which strengthens our hypothesis regarding the shared G4-centric regulation that is ensured by G4-binding proteins. In addition, we have carried out some representative molecular docking to prove that these proteins are able to interact with G4s via their RGG-rich regions, and their conformational geometry enables such interaction. Further studies are needed to evaluate these predicted G4-binding proteins in vitro, followed by in vivo functional studies.