Deciphering RNA-Recognition Patterns of Intrinsically Disordered Proteins

Intrinsically disordered regions (IDRs) and protein (IDPs) are highly flexible owing to their lack of well-defined structures. A subset of such proteins interacts with various substrates; including RNA; frequently adopting regular structures in the final complex. In this work; we have analysed a dataset of protein–RNA complexes undergoing disorder-to-order transition (DOT) upon binding. We found that DOT regions are generally small in size (less than 3 residues) for RNA binding proteins. Like structured proteins; positively charged residues are found to interact with RNA molecules; indicating the dominance of electrostatic and cation-π interactions. However, a comparison of binding frequency shows that interface hydrophobic and aromatic residues have more interactions in only DOT regions than in a protein. Further; DOT regions have significantly higher exposure to water than their structured counterparts. Interactions of DOT regions with RNA increase the sheet formation with minor changes in helix forming residues. We have computed the interaction energy for amino acids–nucleotide pairs; which showed the preference of His–G; Asn–U and Ser–U at for the interface of DOT regions. This study provides insights to understand protein–RNA interactions and the results could also be used for developing a tool for identifying DOT regions in RNA binding proteins.


Introduction
Intrinsically disordered proteins lack stable three-dimensional structures under physiological conditions and are known to perform important roles in several processes including signalling, enzymatic activity, and gene regulation [1,2]. To perform these functions, disordered regions interact with protein, RNA, DNA, and other small molecules to gain ordered structures [3,4]. Experimentally, interactions mediated by IDRs can be observed using NMR and X-ray crystallography. However, because of poor resolution, problems in crystallization, and high time and resource consumption, computational methods are necessary to identify disorder-mediated interactions [5,6].
Several methods have been developed for understanding the disorderness of proteins using sequence or structural information [7][8][9][10]. In addition, the transition of disorder-to-order regions in protein-protein interactions (PPI) is well studied experimentally and computationally [8][9][10][11][12]. For example, LMO4, a putative breast oncoprotein, interacts with various tandem LIM-domain containing proteins mediated by disordered regions [13]. BRCA1, a tumour suppressor protein, helps in binding with multiple protein and DNA partners by its central disorder region of~1500 amino acids [14]. Recently, Papadakos et al. [15] showed that inducing intrinsic disorder in high-affinity protein-protein interactions reduces the affinity of binding.

Results and Discussion
Our dataset contains a total of 23,452 and 2412 residues in non-ribosomal and ribosomal protein-RNA complexes. Among them, 1175 (5%) and 155 (6.4%) residues are found to be in DOT regions in non-ribosomal and ribosomal complexes, respectively. The residues binding with RNA are obtained by using 3.5 and 6 Å distance cut-offs and similar trends are obtained. Therefore, we have presented the results with 3.5 Å and those for 6 Å are shown in supplementary material.

Number of DOT Regions in Protein-RNA Complexes and Length of DOT Regions
The variation in the number of DOT regions in non-ribosomal and ribosomal complexes is shown in Figure 1. We observed that most of the complexes have less than three DOT regions (88% in non-ribosomal and 100% in ribosomal complexes). Most non-ribosomal proteins have one DOT region, whereas ribosomal proteins have mostly two or more DOT regions. In addition, at most eight DOT regions per complex are found in our dataset.  Further, we analysed the length of each DOT region in non-ribosomal and ribosomal protein-RNA complexes, which shows that most DOT regions are short, as shown in Figure 2. In both non-ribosomal and ribosomal complexes, more than 70% of DOT regions have three to 10 residues and very few (only 5) regions have a length of more than 50 residues. This leads to a speculation that only a small conformational change might be required for bringing shape complementarity in protein-RNA complexes and these small DOT regions help in obtaining the same.

Binding Frequency of Residues at DOT Regions
The binding frequencies of residues in DOT regions using 3.5 Å (NR3.5 and RB3.5) and 6 Å (NR6 and RB6) distance cut-offs are shown in Figures 3 and S2, respectively. We observed that among all positively charged residues (Arg, Lys and His), Arg and Lys have high preference for binding in both NR3.5 ( Figure 3a) and NR6 ( Figure S2a) datasets. Interestingly, only eight and 13 among 20 residues are observed in binding DOT regions at RB3.5 ( Figure 3b) and RB6 ( Figure S2b) datasets, and Arg has the highest frequency of binding. Cys, Met, and Trp in DOT regions are not involved in binding with RNA, whereas in ordered complexes 0.97%, 4.52%, and 5.54% of Cys, Met, and Trp are involved in binding, respectively. The comparison of binding site residues in DOT regions and the whole protein showed an expected presence of 1.5% and 2.7% of Met and Trp, respectively, in the interface of the DOT region. These results showed that the non-occurrence of Met and Trp at the interface of the DOT regions is statistically significant.

Binding Frequency of Residues at DOT Regions
The binding frequencies of residues in DOT regions using 3.5 Å (NR3.5 and RB3.5) and 6 Å (NR6 and RB6) distance cut-offs are shown in Figure 3 and Figure S2, respectively. We observed that among all positively charged residues (Arg, Lys and His), Arg and Lys have high preference for binding in both NR3.5 ( Figure 3a) and NR6 ( Figure S2a) datasets. Interestingly, only eight and 13 among 20 residues are observed in binding DOT regions at RB3.5 ( Figure 3b) and RB6 ( Figure S2b) datasets, and Arg has the highest frequency of binding. Cys, Met, and Trp in DOT regions are not involved in binding with RNA, whereas in ordered complexes 0.97%, 4.52%, and 5.54% of Cys, Met, and Trp are involved in binding, respectively. The comparison of binding site residues in DOT regions and the whole protein showed an expected presence of 1.5% and 2.7% of Met and Trp, respectively, in the interface of the DOT region. These results showed that the non-occurrence of Met and Trp at the interface of the DOT regions is statistically significant. We have computed the preference of binding of residues in DOT regions by dividing the number of residues in DOT regions with the total number of binding residues, and the results are presented in Figures 4 and S3 for 3.5 Å (NR3.5 and RB3.5 datasets) and 6 Å (NR6 and RB6 datasets), respectively. In Figure 4a, high frequency of Arg, Gly, Lys, and Ser (z-score > 1) is observed for the NR3.5 dataset, which suggests that these residues are more probable to contact DOT regions with respect to all residues in contact with RNA. However, for the NR6 dataset ( Figure S3a), the result is only consistent for Lys, and two other residues (Glu and Pro) show high binding frequency. In ribosomal protein complexes with 3.5 Å and 6 Å, Ala & Glu, and Glu & Tyr have high frequencies, respectively (Figures  4b and S3b).
a. Non-ribosomal complexes We have computed the preference of binding of residues in DOT regions by dividing the number of residues in DOT regions with the total number of binding residues, and the results are presented in Figure 4 and Figure S3 for 3.5 Å (NR3.5 and RB3.5 datasets) and 6 Å (NR6 and RB6 datasets), respectively. In Figure 4a, high frequency of Arg, Gly, Lys, and Ser (z-score > 1) is observed for the NR3.5 dataset, which suggests that these residues are more probable to contact DOT regions with respect to all residues in contact with RNA. However, for the NR6 dataset ( Figure S3a  We have computed the preference of binding of residues in DOT regions by dividing the number of residues in DOT regions with the total number of binding residues, and the results are presented in Figures 4 and S3 for 3.5 Å (NR3.5 and RB3.5 datasets) and 6 Å (NR6 and RB6 datasets), respectively. In Figure 4a, high frequency of Arg, Gly, Lys, and Ser (z-score > 1) is observed for the NR3.5 dataset, which suggests that these residues are more probable to contact DOT regions with respect to all residues in contact with RNA. However, for the NR6 dataset ( Figure S3a

Binding Propensity of Residues at DOT Region
Propensity is calculated by normalizing the binding frequency of residues in DOT regions with the overall frequency of the respective residues to be in a protein, using Equation (3). This can measure the bias in binding of residues in DOT regions, independent of their count in DOT regions. We have calculated the propensity of amino acids to be in DOT regions using distance cut-offs of 3.5 Å and 6 Å and the results are shown in Figures 5 and S4, respectively. In the NR3.5 (Figure 5a) dataset, His, Arg, Asn, Gln, Phe, and Tyr have high propensity of binding, whereas in ribosomal proteins (RB3.5 dataset; Figure 5b), only His showed a high propensity. In the NR6 ( Figure S4a), His has high propensity, whereas Asn, His and Tyr have high propensity in the RB6 ( Figure S4b) dataset. Similarly, high propensity for binding is observed for positively charged residues along with Tyr and Phe in protein-RNA complexes [34]. On the other hand, among all charged residues only Arg has high tendency to bind with DOT regions in protein-protein complexes [35]. Furthermore, non-specific interactions occurred frequently in protein-protein complexes, which is not a common trend in the binding residues of DOT regions in protein-RNA complexes. Therefore, we can infer that the preferred residues at DOT regions are specific in protein-RNA complexes and, especially, charged interactions are important in DOT regions for binding with RNA.

Binding Propensity of Residues at DOT Region
Propensity is calculated by normalizing the binding frequency of residues in DOT regions with the overall frequency of the respective residues to be in a protein, using Equation (3). This can measure the bias in binding of residues in DOT regions, independent of their count in DOT regions. We have calculated the propensity of amino acids to be in DOT regions using distance cut-offs of 3.5 Å and 6 Å and the results are shown in Figure 5 and Figure S4, respectively. In the NR3.5 ( Figure 5a) dataset, His, Arg, Asn, Gln, Phe, and Tyr have high propensity of binding, whereas in ribosomal proteins (RB3.5 dataset; Figure 5b), only His showed a high propensity. In the NR6 ( Figure S4a), His has high propensity, whereas Asn, His and Tyr have high propensity in the RB6 ( Figure S4b) dataset. Similarly, high propensity for binding is observed for positively charged residues along with Tyr and Phe in protein-RNA complexes [34]. On the other hand, among all charged residues only Arg has high tendency to bind with DOT regions in protein-protein complexes [35]. Furthermore, non-specific interactions occurred frequently in protein-protein complexes, which is not a common trend in the binding residues of DOT regions in protein-RNA complexes. Therefore, we can infer that the preferred residues at DOT regions are specific in protein-RNA complexes and, especially, charged interactions are important in DOT regions for binding with RNA.

Binding Propensity of Residues at DOT Region
Propensity is calculated by normalizing the binding frequency of residues in DOT regions with the overall frequency of the respective residues to be in a protein, using Equation (3). This can measure the bias in binding of residues in DOT regions, independent of their count in DOT regions. We have calculated the propensity of amino acids to be in DOT regions using distance cut-offs of 3.5 Å and 6 Å and the results are shown in Figures 5 and S4, respectively. In the NR3.5 ( Figure 5a) dataset, His, Arg, Asn, Gln, Phe, and Tyr have high propensity of binding, whereas in ribosomal proteins (RB3.5 dataset; Figure 5b), only His showed a high propensity. In the NR6 ( Figure S4a), His has high propensity, whereas Asn, His and Tyr have high propensity in the RB6 ( Figure S4b) dataset. Similarly, high propensity for binding is observed for positively charged residues along with Tyr and Phe in protein-RNA complexes [34]. On the other hand, among all charged residues only Arg has high tendency to bind with DOT regions in protein-protein complexes [35]. Furthermore, non-specific interactions occurred frequently in protein-protein complexes, which is not a common trend in the binding residues of DOT regions in protein-RNA complexes. Therefore, we can infer that the preferred residues at DOT regions are specific in protein-RNA complexes and, especially, charged interactions are important in DOT regions for binding with RNA.

Comparison of Frequency of Binding in the DOT Region and Other Residues of a Protein
To estimate the difference between binding in DOT regions and other part of proteins, we calculated the binding frequency of amino acids in these regions, as shown in Figure 6 and Figure S5. Amino acids significantly differ in their binding with RNA in DOT regions and in the complete protein (p-value for the mean is less than 0.01). In non-ribosomal proteins, when the 3.5 Å cut-off is considered, nonpolar and aromatic residues mostly have high frequency values in the DOT regions than in the overall protein. All the frequencies are observed to be significant when statistical analysis is performed for the bootstrapped sample of the frequencies (p-value is less than 0.01). Residues such as His, Phe, and Leu are found to have a more than 3-fold increase in the frequency of binding in the DOT regions than in other parts of the proteins. A similar trend is observed in the NR6 dataset ( Figure S5).

Comparison of Frequency of Binding in the DOT Region and Other Residues of a Protein
To estimate the difference between binding in DOT regions and other part of proteins, we calculated the binding frequency of amino acids in these regions, as shown in Figures 6 and S5. Amino acids significantly differ in their binding with RNA in DOT regions and in the complete protein (p-value for the mean is less than 0.01). In non-ribosomal proteins, when the 3.5 Å cut-off is considered, nonpolar and aromatic residues mostly have high frequency values in the DOT regions than in the overall protein. All the frequencies are observed to be significant when statistical analysis is performed for the bootstrapped sample of the frequencies (p-value is less than 0.01). Residues such as His, Phe, and Leu are found to have a more than 3-fold increase in the frequency of binding in the DOT regions than in other parts of the proteins. A similar trend is observed in the NR6 dataset ( Figure  S5).

Amino Acid Contact Frequency with Nucleotides
We have also analysed amino acid contacts with each nucleotide in non-ribosomal complexes using 3.5 Å and 6 Å distance cut-offs for contacting residues and the results are shown in Figures 7 and S6. In the 3.5 Å distance criterion, Arg and Lys have a high frequency to bind with nucleotides. Arg and Lys are observed to have the most and least binding frequencies with Guanine and Uracil, respectively. Whereas in the 6 Å criterion, almost the same frequency of binding is observed for Arg and Lys with Adenine, Guanine and Cytosine nucleotides; least binding was observed in the Uracil nucleotide. When compared with the results presented for ordered protein-DNA and protein-RNA complexes in our earlier works, Arg, Lys, Trp, and Tyr were favoured by RNA and Arg was selected by DNA-binding proteins together with Guanine in DNA and Uracil in RNA-protein complexes [36].

Amino Acid Contact Frequency with Nucleotides
We have also analysed amino acid contacts with each nucleotide in non-ribosomal complexes using 3.5 Å and 6 Å distance cut-offs for contacting residues and the results are shown in Figure 7 and Figure S6. In the 3.5 Å distance criterion, Arg and Lys have a high frequency to bind with nucleotides. Arg and Lys are observed to have the most and least binding frequencies with Guanine and Uracil, respectively. Whereas in the 6 Å criterion, almost the same frequency of binding is observed for Arg and Lys with Adenine, Guanine and Cytosine nucleotides; least binding was observed in the Uracil nucleotide. When compared with the results presented for ordered protein-DNA and protein-RNA complexes in our earlier works, Arg, Lys, Trp, and Tyr were favoured by RNA and Arg was selected by DNA-binding proteins together with Guanine in DNA and Uracil in RNA-protein complexes [36].

Comparison of Frequency of Binding in the DOT Region and Other Residues of a Protein
To estimate the difference between binding in DOT regions and other part of proteins, we calculated the binding frequency of amino acids in these regions, as shown in Figures 6 and S5. Amino acids significantly differ in their binding with RNA in DOT regions and in the complete protein (p-value for the mean is less than 0.01). In non-ribosomal proteins, when the 3.5 Å cut-off is considered, nonpolar and aromatic residues mostly have high frequency values in the DOT regions than in the overall protein. All the frequencies are observed to be significant when statistical analysis is performed for the bootstrapped sample of the frequencies (p-value is less than 0.01). Residues such as His, Phe, and Leu are found to have a more than 3-fold increase in the frequency of binding in the DOT regions than in other parts of the proteins. A similar trend is observed in the NR6 dataset ( Figure  S5). Figure 6. Binding frequency of each amino acid in the DOT region and in the overall protein for nonribosomal complexes using the 3.5 Å cut-off.

Amino Acid Contact Frequency with Nucleotides
We have also analysed amino acid contacts with each nucleotide in non-ribosomal complexes using 3.5 Å and 6 Å distance cut-offs for contacting residues and the results are shown in Figures 7 and S6. In the 3.5 Å distance criterion, Arg and Lys have a high frequency to bind with nucleotides. Arg and Lys are observed to have the most and least binding frequencies with Guanine and Uracil, respectively. Whereas in the 6 Å criterion, almost the same frequency of binding is observed for Arg and Lys with Adenine, Guanine and Cytosine nucleotides; least binding was observed in the Uracil nucleotide. When compared with the results presented for ordered protein-DNA and protein-RNA complexes in our earlier works, Arg, Lys, Trp, and Tyr were favoured by RNA and Arg was selected by DNA-binding proteins together with Guanine in DNA and Uracil in RNA-protein complexes [36].

Secondary Structure of DOT and RNA-Interacting DOT Residues
The secondary structures of DOT residues are quantified to study the bias of residues to have a specific secondary structure in binding and non-binding regions and data are presented in Table 1 and  Table S1 for NR3.5 and NR6 datasets, respectively. In the NR3.5 dataset, all the DOT residues have lower and higher preference in sheet (15%) and other structure class (8%), respectively. Interestingly, in DOT residues, binding with RNA molecules, strand-forming residues have a higher preference (15.2%) as compared to helical (8.6%) and other regions (8.9%). Percentage is mentioned in the parenthesis. Relative binding in DOT regions are calculated by N idt /N d × 100.

Relative Solvent Accessibility of DOT Residues
The spatial arrangement of DOT residues is further explored by solvent accessibility calculation and the result is shown in Table 2. Comparison of RASA of DOT regions and complete protein-RNA complex revealed that in DOT regions, solvent accessibility of every amino acid is more than that of other amino acids of a protein. As expected, charged residues have low fold difference (1.18 to 1.28) in RASA in DOT regions and the complete protein. However, most hydrophobic residues (Ala, Cys, Ile, Leu, Met, Phe, Tyr, and Val) have about 1.8 to 2 folds higher RASA in DOT regions than the complete protein, Met has the highest difference. On the other hand, the mean solvent accessibility of DOT regions of proteins is 44 Å 2 , which is similar to the average RASA of binding DOT regions (43 Å 2 ) of protein-protein complexes [17].

Number of Residues in Contact with Nucleotides in the DOT Region and in Entire Protein
Among 1175 residues in DOT regions in our dataset, only 96 (8.17%) and 268 (22.81%) are in contact with nucleotides in the NR3.5 and the NR6 dataset, respectively. Almost all the residues have a similar tendency of binding with nucleotides in proteins, ranging between 20% to 29%, as shown in Table 3 and Table S2. However, the number of nucleotides interacting with DOT residues is somewhat different, that is, the range of interaction is 18 to 33%. The DOT residues are more likely to bind with Guanine (20.4%), followed by Cytosine and Uracil, than to binding with Adenine (13.1%).

Secondary Structure of Nucleotides Interacting with DOT Residues
Further, we have classified the nucleotides based on location and contacts with DOT residues and preference of amino acids in a protein and the results are presented in Table 4 and Table S3. Among all secondary structures formed by nucleotides, unpaired bases are most likely to bind with DOT residues. Specifically, we observed that A and U in unpaired regions prefer to interact with DOT residues, whereas C and G in unpaired and base-paired positions interact with DOT residues with a similar preference. G and C also interact with DOT residues in pseudoknot secondary structure, whereas A and U are least likely to exist in pseudoknot form when bound to DOT regions.

Interaction Energy of DOT Residues with Nucleotides
We have computed the interaction energy between amino acids and nucleotides in DOT and ordered regions at the binding interface and the results are presented in Table 5. Most of the amino acids have stronger interactions with nucleotides in ordered regions than DOT regions. However, we noticed that some combinations of amino acid-nucleotide pairs have favourable energy when interacting with DOT regions. For example, Arg, His, Ile, Leu, Val, and Phe interact with G, His, Ser, and Val with C, and Asn, Asp, Gly, Ile, Leu, and Ser with U. In addition, hydrophobic residues Ile, Leu, and Val have more favourable interactions with G at DOT regions than others. Since Arg and Lys are important for protein-RNA complex formation through electrostatic interactions these residues have stronger energies in ordered regions than DOT regions. On the other hand, His in the DOT region has favourable energy with G and C. These differences in energy could be important to understand the interactions between DOT regions and the RNA molecule, which might also be used to distinguish the RNA binding residues of proteins in DOT and other regions. Table 5. Interaction energy between amino acids and nucleotides in DOT regions. We have compared the interaction energy of amino acid-nucleotide pairs in the interface of DOT and other regions and two typical examples are shown in Figure 8. We noticed a wide range of interactions such as stacking, cation-π, electrostatic, and van der Waals interactions at the interface. Most favourable energy is observed for Asn and His with U (−3.26 kcal/mol) and G (−5.44 kcal/mol), respectively, in DOT regions (Figure 8a). On the other hand, Arg and Phe have favourable energy with A (−8.49 kcal/mol) and C (−4.88 kcal/mol), respectively, in non-DOT regions.

Materials and Methods
We adopted the following protocol to obtain a set of protein-RNA complexes with disorder-toorder transition (DOT) regions: (i) Downloaded the protein-RNA complexes from PDB and NDB databases (www.rcsb.org) [37][38][39]; (ii) Clustered all the protein-RNA complexes with 30% sequence identity cut-off using CD-Hit suite [40]; (iii) Performed BLAST search (using 99% identity cut-off) of protein sequences to obtain free proteins corresponding to each protein-RNA complex [41,42]. The free proteins have the same sequences as the protein part of protein-RNA complexes but crystallized without RNA. Note that free proteins contain unique PDB IDs, which is distinct from the protein-RNA complex; (iv) Disordered residues are obtained from missing residues information in the protein-RNA complex and free protein pairs by locating "REMARK 465" statement in the protein structure file; (v) DOT residues are isolated by comparing the disorder residues of free and protein-RNA complex pairs such that the residue is ordered in the protein-RNA complex but disordered in free protein. Note that only the regions having 3 or more continuous DOT residues are considered. The final dataset contains 101 DOT regions in 52 proteins and complete data are given in supplementary information. The representation of DOT and ordered region in a typical protein-RNA complex (PDB ID: 4H4K) is shown in Figure 9.

Materials and Methods
We adopted the following protocol to obtain a set of protein-RNA complexes with disorder-to-order transition (DOT) regions: (i) Downloaded the protein-RNA complexes from PDB and NDB databases (www.rcsb.org) [37][38][39]; (ii) Clustered all the protein-RNA complexes with 30% sequence identity cut-off using CD-Hit suite [40]; (iii) Performed BLAST search (using 99% identity cut-off) of protein sequences to obtain free proteins corresponding to each protein-RNA complex [41,42]. The free proteins have the same sequences as the protein part of protein-RNA complexes but crystallized without RNA. Note that free proteins contain unique PDB IDs, which is distinct from the protein-RNA complex; (iv) Disordered residues are obtained from missing residues information in the protein-RNA complex and free protein pairs by locating "REMARK 465" statement in the protein structure file; (v) DOT residues are isolated by comparing the disorder residues of free and protein-RNA complex pairs such that the residue is ordered in the protein-RNA complex but disordered in free protein. Note that only the regions having 3 or more continuous DOT residues are considered. The final dataset contains 101 DOT regions in 52 proteins and complete data are given in supplementary information. The representation of DOT and ordered region in a typical protein-RNA complex (PDB ID: 4H4K) is shown in Figure 9.

Number of DOT Regions and Their Lengths
The number of DOT regions and their lengths are obtained by counting the number of nonconsecutive and consecutive residues, respectively, using custom build python scripts.

DOT Residues in Contact with RNA
The residues in contact with RNA molecules are obtained by using distance cut-offs mentioned in literature, that is, 3.5 Å and 6 Å [43][44][45]. Binding residues in DOT regions are obtained by taking common residues in the DOT dataset and RNA contacting residues. We have classified protein-RNA complexes in non-ribosomal and ribosomal classes because of the difference in their interaction pattern, number of interacting amino acids, and residue bias in them [46]. Therefore, using the type of complex and distance cut-off for interacting residues, we divided protein-RNA complexes into four different datasets: (1) NR3.5: non-ribosomal complex with a contact distance of 3.5 Å; (2) RB3.5: ribosomal complex with a contact distance of 3.5 Å; (3) NR6: non-ribosomal complex with a contact distance of 6 Å; and (4) RB6: ribosomal complex with a contact distance of 6 Å.
We computed the frequency of each DOT residue involved in binding using the Equation (1).
Frequency of binding residues in DOT region = where Nib: number of ith residues binding in the DOT region and Nid: number of ith residues in DOT. Moreover, the differences in the frequency of binding residues in DOT regions and in the protein complexes are obtained.

Frequency of Binding in DOT and Other Residues
We also computed the frequency of residues binding in DOT regions over all the binding residues by using Equation (2), an error bar is plotted using the bootstrap method by randomly resampling an equal sized data with a replacement 1000 times. The disorder-to-order transition (DOT) region can be clearly seen in green with a missing overlapping region of free protein.

Number of DOT Regions and Their Lengths
The number of DOT regions and their lengths are obtained by counting the number of non-consecutive and consecutive residues, respectively, using custom build python scripts.

DOT Residues in Contact with RNA
The residues in contact with RNA molecules are obtained by using distance cut-offs mentioned in literature, that is, 3.5 Å and 6 Å [43][44][45]. Binding residues in DOT regions are obtained by taking common residues in the DOT dataset and RNA contacting residues. We have classified protein-RNA complexes in non-ribosomal and ribosomal classes because of the difference in their interaction pattern, number of interacting amino acids, and residue bias in them [46]. Therefore, using the type of complex and distance cut-off for interacting residues, we divided protein-RNA complexes into four different datasets: (1) NR3.5: non-ribosomal complex with a contact distance of 3.5 Å; (2) RB3.5: ribosomal complex with a contact distance of 3.5 Å; (3) NR6: non-ribosomal complex with a contact distance of 6 Å; and (4) RB6: ribosomal complex with a contact distance of 6 Å.
We computed the frequency of each DOT residue involved in binding using the Equation (1).
Frequency of binding residues in DOT region = N ib N id (1) where N ib : number of ith residues binding in the DOT region and N id : number of ith residues in DOT. Moreover, the differences in the frequency of binding residues in DOT regions and in the protein complexes are obtained.

Frequency of Binding in DOT and Other Residues
We also computed the frequency of residues binding in DOT regions over all the binding residues by using Equation (2), an error bar is plotted using the bootstrap method by randomly re-sampling an equal sized data with a replacement 1000 times.
Frequency of binding by contact residues = N ibd N ib (2) where N ibd : number of ith residues binding in DOT region; N ib is number of ith residues binding with RNA in complete protein.

Propensity of Binding Residues in DOT Region
The normalization of frequency of residues present in DOT regions by individual residue frequency provides the tendency of a residue in DOT regions. Accordingly, propensity values are calculated using the following equation: where Propensity (I): propensity of ith residue; N ibd : number of ith residue binding in DOT region; N id : number of ith residue in DOT regions; N ip : number of ith residue in protein; N p : number of residues in protein.

Boot Strap Sampling
To obtain the standard error in frequency and propensity calculations, bootstrap sampling is performed. In this technique all the protein-RNA complexes are sampled randomly and each sample contains complexes equal to the number of protein-RNA complexes. Therefore, each sample will have redundancy of some complexes and will be devoid of some complexes. In this manner, we have created 1000 samples on which the calculations are performed.

Relative Average Solvent Accessibility (RASA)
The DOT residues buriedness is analysed by the NACCESS [47] program and the RASA of each residue is calculated by using Equation (4).
where A ibd : RASA of ith residue binding with RNA in DOT region; n: number of DOT residues in a protein-RNA complex.

Secondary Structure of Protein and RNA
Secondary structure of both proteins and RNA molecules are analysed by DSSP and DSSR programs, respectively [48,49]. The DSSR program gives dot bracket notation of secondary structure of RNA as shown in Figure S1, in which "." represents unpaired nucleotide, "(" or ")" represent paired bases, and "{" or "}" or "[" or "]" or "<" or ">" represent pseudoknot bases.

Binding Preference of Nucleotides for Amino Acids
The binding preference of nucleotide with DOT residues has been calculated by counting the occurrence of nucleotides-amino acid interacting pairs under the distance of 3.5 Å.

Interaction Energy between Amino Acids and Nucleotides at Binding Interface
The interaction energy of amino acids with nucleotides is computed using van der Waals and coulombs potential using AMBER force field [50]. It is given by where, A ij = ε ij * (R ij *) 12 and B ij = 2 ε ij * (R ij *) 6 ; R ij * = (R i * + R j *); and ε ij * = (ε i * ε j *) 1/2 ; R* and ε* van der Waals radius and well depth, respectively, and these parameters are obtained from Gromiha et al. [51]; q i and q j is the charge on atom i and j, respectively and R ij is the distance separating atom i and j.

Conclusions
The analysis of DOT regions in protein-RNA complexes revealed that in each complex these regions are generally small in size. Electrostatic interactions are found to be important, with the involvement of positively charged residues (Arg, Lys and His) in DOT regions. Among nucleotide-amino acid pairs, guanine-Arg and uracil-Lys pairs are identified to be the most and the least preferred ones at the interface, respectively. Generally, nucleotides prefer to bind DOT regions than other regions of protein.
Further, DOT regions are significantly more exposed to solvent than other residues of protein-RNA complexes. Specifically, hydrophobic residues have higher difference in RASA of DOT regions and complete proteins. DOT regions are preferred to form coils, turns, and bends than regular secondary structures such as helices and strands. On the RNA side, DOT residues prefer to bind unpaired A and U and paired regions of C and G. In pseudoknot condition, mostly C and G interact with DOT residues. The interaction energy calculations revealed the types of interactions and preferred amino acid-nucleotide pairs at the interface based on energy.
The frequencies and propensities obtained in the present study could be used for discriminating DOT binding residues from other residues. Further, the location of DOT binding residues based on solvent accessibility and secondary structure of protein and RNA along with energy calculations may help to understand the recognition mechanism.
We obtained the DOT regions by comparing 3D coordinates of the missing residues in protein-RNA complexes and their respective free proteins. This might be an under representation of DOT regions since the structures solved by crystallization often stabilize the residues and reduce the native disorder. Hence, the disordered residues having 3D coordinates in free proteins are not considered. The current study can further be refined with the availability of more numbers of protein-RNA complexes and the improvements in structure determination techniques. In addition, development of disorder specific databases for protein-nucleic acid complexes with large datasets could enhance the confidence level of the result reported in the present study.