Computational Prediction of RNA-Binding Proteins and Binding Sites

Proteins and RNA interaction have vital roles in many cellular processes such as protein synthesis, sequence encoding, RNA transfer, and gene regulation at the transcriptional and post-transcriptional levels. Approximately 6%–8% of all proteins are RNA-binding proteins (RBPs). Distinguishing these RBPs or their binding residues is a major aim of structural biology. Previously, a number of experimental methods were developed for the determination of protein–RNA interactions. However, these experimental methods are expensive, time-consuming, and labor-intensive. Alternatively, researchers have developed many computational approaches to predict RBPs and protein–RNA binding sites, by combining various machine learning methods and abundant sequence and/or structural features. There are three kinds of computational approaches, which are prediction from protein sequence, prediction from protein structure, and protein-RNA docking. In this paper, we review all existing studies of predictions of RNA-binding sites and RBPs and complexes, including data sets used in different approaches, sequence and structural features used in several predictors, prediction method classifications, performance comparisons, evaluation methods, and future directions.


Introduction
Approximately 6%-8% of proteins are RNA-binding proteins (RBPs). These RBPs play an important part in gene expression and regulation. Due to study limitations, only a few types of RBPs have been identified such as HuR, AUF1, TTP, TIA1, and CUGBP2. These RBPs perform essential roles in various biological processes such as mRNA stability [1], stress responses [2], cell cycle, tumor differentiation [3], apoptosis, and gene regulation at the transcriptional and post-transcriptional levels [4]. Determining the three-dimensional (3D) structures of protein-RNA complexes facilitates the identification of physiochemical properties and biological interactions.
Experimental methods (e.g., nuclear magnetic resonance spectroscopy (NMR) [5] and X-ray crystallography [6]) typically used for protein-RNA complex structure determination are expensive, time-consuming and labor-intensive. To date, 2274 protein-RNA complex structures determined by experimental methods have been deposited in the Protein Data Bank (PDB) database [7]. The number of protein-RNA complexes in the PDB database is significantly fewer than that which exists in nature. Given the large numbers of nucleic acid and protein sequences that exist, improved knowledge of how protein-RNA interactions occur could help us to recognize functional information.
To achieve this goal, it is necessary to develop computational approaches which can reliably and rapidly identify RAN-binding proteins or sites. In contrast with experimental methods, computational tools could inexpensively and quickly identify RNA-binding sites and RBPs, which would be useful and helpful in studying protein-RNA interactions [8]; however, those methods based only on amino acid sequence information are difficult since organisms are highly complex. Several methods have been developed which focus on predicting RNA-binding sites and determining whether a protein-RNA complex exists. The majority of previous studies have focused on prediction approaches for RNA-binding sites and RBPs based on sequence similarity [9][10][11][12]. The query protein sequences were searched against databases; if the homologous sequences were known RNA-binding proteins, the query protein was regarded as an RNA-binding protein. Similarly, RNA-binding residues and sites in the query sequence could be detected. For another, methods based on predicted structural and sequence information are the most often used computational approaches to identify RNA-binding sites or RBPs. If the 3D structure of a target protein is known, the prediction based on structure information was carried out to distinguish RBPs [13][14][15]. It is believed that the structural similarity could provide more reliable and in-depth prediction consequence. Another technique is docking, a method started from the components coordinates, and aimed at modelling interaction conformation of macromolecular complexes [16]. Many protein-protein docking tools have been reported, but no specific RNA-protein docking method exists [17]. Several protein-protein docking programs accept RNA and protein coordinates as inputs to generate RBPs, such as HADDOCK [18], GRAMM [19], HEX [20], PatchDock [21], and FTDock [22]. The above strategies for RNA-binding site and RBP prediction are summarized in Figure 1.
Int. J. Mol. Sci. 2015, 16, page-page 2 computational tools could inexpensively and quickly identify RNA-binding sites and RBPs, which would be useful and helpful in studying protein-RNA interactions [8]; however, those methods based only on amino acid sequence information are difficult since organisms are highly complex. Several methods have been developed which focus on predicting RNA-binding sites and determining whether a protein-RNA complex exists. The majority of previous studies have focused on prediction approaches for RNA-binding sites and RBPs based on sequence similarity [9][10][11][12]. The query protein sequences were searched against databases; if the homologous sequences were known RNA-binding proteins, the query protein was regarded as an RNA-binding protein. Similarly, RNA-binding residues and sites in the query sequence could be detected. For another, methods based on predicted structural and sequence information are the most often used computational approaches to identify RNA-binding sites or RBPs. If the 3D structure of a target protein is known, the prediction based on structure information was carried out to distinguish RBPs [13][14][15]. It is believed that the structural similarity could provide more reliable and in-depth prediction consequence. Another technique is docking, a method started from the components coordinates, and aimed at modelling interaction conformation of macromolecular complexes [16]. Many protein-protein docking tools have been reported, but no specific RNA-protein docking method exists [17]. Several protein-protein docking programs accept RNA and protein coordinates as inputs to generate RBPs, such as HADDOCK [18], GRAMM [19], HEX [20], PatchDock [21], and FTDock [22]. The above strategies for RNA-binding site and RBP prediction are summarized in Figure 1. Although the methodology for predicting protein-protein interactions and protein-DNA interactions are well established [23,24], analyses of computational approaches used to identify protein-RNA interactions are lacking [8,17]. In this review, we discuss computational approaches for predicting RBPs and RNA-binding sites based on protein sequences or known protein 3D structures. Moreover, RNA-protein complex docking methods were discussed. We summarize detailed information of these computational tools, including various vectors based on sequence and/or structure, datasets used in the algorithm, performance comparison, machine learning methods, and so on. In particular, we summarize those available web servers for RNA-binding sites and RBP Although the methodology for predicting protein-protein interactions and protein-DNA interactions are well established [23,24], analyses of computational approaches used to identify protein-RNA interactions are lacking [8,17]. In this review, we discuss computational approaches for predicting RBPs and RNA-binding sites based on protein sequences or known protein 3D structures. Moreover, RNA-protein complex docking methods were discussed. We summarize detailed information of these computational tools, including various vectors based on sequence and/or structure, datasets used in the algorithm, performance comparison, machine learning methods, and so on. In particular, we summarize those available web servers for RNA-binding sites and RBP prediction, which are convenient for scientists. Finally, the future directions and several implications have been discussed, which can aid in method development.

Data Set
The sequence and structure of protein-RNA complexes are available from PDB database and other specific protein-RNA interaction databases (Available online: http://pridb.gdcb.iastate.edu/) [25]. We analyzed several previous studies and summarize the datasets and methods used, which are listed in detail in Table 1. Of all existing datasets, RB344 is the largest and contains 344 non-redundant RBPs with at least 30% sequence identity [26]. In several studies, authors employed the same dataset to compare the advantages and disadvantages of various methods. In particular, Cheng et al. [27] constructed a novel PRIPU dataset which differed from previous datasets. The PRIPU dataset contained positive and unlabeled, but not negative samples. Such negative samples sometimes are not necessarily genuine negative samples and may even be unknown positive samples. RNA-binding residues are determined using two definitions: (i) a residue with any atom within 3-6 Å of any atom in a nucleotide; and (ii) residues involved in hydrophobic, electrostatic interactions with nucleotides, van der Waals, or hydrogen-bonding [25]. Residues satisfying these definitions are considered to be RNA-binding residues. As with protein-DNA complexes and protein-protein complexes, similar sequences in protein-RNA interactions are eliminated before dataset construction.
Generally, sequences with similarities greater than 30%-40% are considered redundant. Clustering programs such as blastclust (available from NCBI), CD-HIT [34], and the PISCES web server are used to generate a non-redundant dataset.

Feature Selection for RNA-Binding Residues and Protein Predictors
Many features have been used to identify RBPs and binding sites. There are three kinds of features here, which are structure-based features, sequence-based features, chemical and physical features. The commonly used features summarized here include amino acid composition, sequence similarity, evolutionary information, accessible surface area (ASA), predicted secondary structures (SSs), hydrophobicity, electrostatic patches, cleft sizes, and other global protein features. Details of these features are shown as follows.

Sequence-Based Features
Amino Acid Composition One of the most commonly used features of protein sequence is protein amino acid composition, not only in protein-protein interaction site prediction, but also in RNA-binding site prediction. The 20 amino acids exhibit various properties based on the presence of hydrophobic residues (G, F, L, M, A, I, P, V), polar residues (Q, T, S, N, C, Y, W), and charged residues (H, R, K, E, D) [35]. One of the encoding methods are based on the physicochemical properties of the various residue types. The hydrophobic, polar, charged and residues are encoded as (1 0), (0 1), and (0 0), respectively. Particularly, the positively-charged RNA backbone is usually more likely to combine with the negatively-charged residues, as shown in previous studies [36]. The other encoding method is standard binary encoding, which encodes each amino acid as a 20-dimensional binary vector, such as E (0 0

Sequence Similarity
Sequence similarity (also referred to as sequence conservation) is frequently used for RNA-binding site prediction. The BLAST and PSI-BLAST programs are used to compare the similarities among various protein sequences. Generally, multiple sequence alignment (MSA) were obtained by comparing query sequences against the NCBI non-redundant database and were used to calculate each residue's sequence similarity score. A number of conservation scoring tools are available including relative entropy, von Neumann entropy, Shannon entropy, and Scorecons.

Evolutionary Information
Evolutionary information has often been introduced in functional site predictors in recent studies, including RNA-binding site prediction. Previous studies showed that position-specific scoring matrix (PSSM) (an important form of evolutionary information) greatly improved the performance of RBPs prediction. PSSMs were used widely in pervious prediction studies because they provide the likelihood of a particular residue substitution based on evolutionary information.

Structure-Based Features
The Secondary Structure (SS) The secondary structure (SS) provides local and geometric patterns, which can be obtained in two ways: One is that the protein structure is available and real SS could be calculated using SS assignment approach such as DSSPcont [37,38], the other is that the protein structure is unavailable and predicted SS could be obtained using SS predicted algorithm such as PSIPRED [38][39][40]. SS has been employed as an encoding feature in several studies to predict RNA-binding residues [41,42].
Accessible Surface Area (ASA) RNA-binding residues tend to be exposed and interact with proteins, so calculation of solvent accessibility would be helpful in RNA-binding sites prediction. The relative ASA could be calculated using NACCESS [43,44], while the protein structure is available. It is worth pointing out that the relative ASA could not be calculated when the DNA molecule was absent. Residues with ASA value greater than 5% are defined as surface accessible residues.

Chemical and Physical Features
Hydrophobicity Hydrophobicity, which represents the proportion of residues repelled by water, is frequently used by RNA-binding site predictors. Hydrophobicity scale was defined with numerical value for each amino acid [45].

Electrostatic Patches
A protein surface status can be described by electrostatic patches. Generally, nucleic acid-binding interfaces are more likely to be positively charged electrostatic patches [46]. Electrostatic patches can be computed using GRASP [47], GRASS [48], or the web server PFplus (PatchFinderPlus; Available online: http://pfp.technion.ac.il) [49].

Cleft Size
Cleft size is an important feature because the largest cleft on a protein surface tends to be where the protein active site is located [50]. The charge, dipole, and quadrupole moments can also be used to adequately recognize RBPs [51].

Prediction Methods
The computational methods used in previous studies to identify RBPs or RNA-binding sites can be divided into three aspects: (1) the use of sequence-based prediction methods when structure is unknown and sequence is known; (2) prediction methods based on structure when the query protein structure has been resolved; and (3) modeling using a docking method when the query structure is unknown. These three approaches are detailed below.

Sequence-Based Methods for RNA-Binding Site Prediction
The number of known protein-RNA complex structures is few, so prediction methods which use only sequence information play an important role. Previously, Jeong et al. [52] introduced a predictor for RNA-binding sites using predicted secondary structure and amino acid type, and employingan artificial neural network. Subsequently, Terribilini et al. [33] contributed RNABindR, which is a classical method to train naive Bayes (NB) classifiers to predict RNA-binding sites. The RB109 dataset is listed in Table 1. Wang and Brown developed the BindN tool, which is a predictor of RNA-and DNA-binding sites [9]. The sequence features used in this method include molecular mass, hydrophobicity index, and the side chain pKa value. In addition, the evolutionary information was added to predictors, especially in the form of PSSMs. Pprint was developed by Kumar et al. [31], which combined evolutionary information (PSSM) and support vector machines (SVMs) and improved RNA-binding site and residue predictions significantly. Wang et al. [53] used SVM and PSSM profiles coupled with predicted SS and PSI-BLAST profiles in the PRINTR method to obtain improved performance. Tong et al. [54] introduced RISP, which is a hybrid RNA-binding site predictor which uses SVMs in conjunction with PSSMs and achieved a 61.0% increase in sensitivity and an 83.3% increase in specificity. A similar method, RNAProB, using SVM and a novel smoothed PSSM encoding method, was developed by Cheng et al. [55] and it gave better performance than the then current state-of-the-art systems. In 2010, Li et al. [56] constructed a novel method, employing evolutionary PSSM and structure-derived features to predict RNA-binding residues, which led to significant improvement. Liu et al. [30] proposed a novel classification system that combined sequence/structure-based features and interaction propensity, which is a novel interacting feature. In addition, a novel machine learning method (random forest) was used. Furthermore, Liu et al. compared their method with previous methods (e.g., RNAProB, PPRint, BindN and RNABindR) and achieved enhanced performance. Zhang et al. [57] presented an RNA-binding residue predictor using solvent accessibility, predicted SS, evolutionary conservation and sequence information. RNABindRPlus [58] is a recently developed predictor which obviously improved prediction reliability, which combines sequence homology and machine learning methods. Recently, Cheng et al. [27] developed a predictor (PRIPU) for protein-RNA interactions; the most important difference between this and original methods is that only positive and unlabeled samples are used in PRIPU, not negative samples.

Sequence-Based Methods for RNA-Binding Proteins (RBPs) Prediction
Han et al. [36] explored the SVM machine learning method to predict RBPs directly based on their primary sequence. The dataset in this work contained 447 RBPs and 4881 non-RBPs. The prediction accuracy was 40.0% and 99.9% for snRBPs and non-snRBPs, respectively, indicating the need for a sufficient number of proteins to train the SVMs. Shao et al. [59] developed a predictor to predict RNA-binding proteins with SVM methods using sequence characteristics. Similar to RNA-binding site prediction, evolutionary information was introduced to improve the performance of RBP predictions. Kumar et al. [60] exploited RNApred which combined binding residues and PSSM profiles and the SVM method to discriminate RBPs and non-RBPs. Another voting system was used to identify RBPs [42]. Zhao et al. developed SPOT for prediction of RBPs using a fold recognition method, which is freely available on the internet for academic users ( Table 2).

Structure-Based Methods for RNA-Binding Site Prediction
When the structure of the query protein is available and employed in the prediction system, the prediction became more reliable. There are a number of structure-based RNA-binding site prediction methods. Kim et al. [13] developed KYG method, which uses sequence profiles, doublets of spatially close residues, a number of binding scores, and combinations. Chen and Lim [61] developed a predictor based on structure information including electrostatics, evolution, and geometry. The disparate cleft and the surface patch were considered to be RNA-binding site. Subsequently, PRIP [62] was created, which exploited structural and topological information (retention coefficient, betweenness-centrality, accessible surface area and PSI-BLAST profile) and used two machine learning methods (SVM and naive Bayes classifiers). Towfic et al. [63] contributed Struct-NB, which used structural features to predict RNA-binding sites by combining a naive Bayes classifier. Recently, two predictors based on structure were proposed. RBRDetector [64], which uses evolutionary and microenvironmental features as inputs, combines feature-and template-based strategies to improve predictions of RNA-binding residues. The other predictor compares each template patch with surrounding patches and uses the accumulated distances as structural features [26].

Structure-Based Methods for RBP Prediction
Zhao et al. [15] introduced a predictor for RNA-binding domains based on structure information, which combined RNA binding affinity and relative structural similarity. SPOT-Seq-RNA [65] is a template-based structure prediction package which integrates RBP, RNA-binding residue, and protein-RNA complex structure prediction. RBPs and protein-RNA complexes are often modeled using the docking method.

Protein-RNA Complex Docking
Research on protein 3D structure modeling has become increasingly complex. Modeling structures of a protein-RNA complex is very important to help us understand the mechanisms of interaction. Several docking techniques used to predict protein-RNA complexes rely on known RNA and protein structures. There are no protein-RNA interaction docking algorithms, most reported docking techniques are modified from those protein-ligand interaction and protein-protein interaction docking softwares by employing certain energy/scoring function that fitted for protein-RNA interactions. For example, Katchalski-Katzir et al. [19] developed a low-resolution docking program, which requires specific scoring functions for different ligands. In the modeling progress, the program performs a six-dimensional search through the rotation of a ligand molecule and the rigid body translation and generates decoys. Gabb et al. [22] employed the FTDOCK program, which not only accepts protein-protein docking, but also accepts nucleic acid molecules. Ritchie and Kemp [20] introduced Hex, which enables protein-nucleic acid and protein-protein docking. The decoy scoring method contains electrostatics and shape-matching but does not have a special function for protein-RNA complexes. The method of Haddock [18] enables various molecules (e.g., nucleic acids, proteins and other small molecules) for docking, which using biochemical and biophysical characteristics as inputs. Recently, Tuszynska and Bujnicki [66] developed QUASI-RNP and DARS-RNP, which use statistical and quasi-chemical reference states to score protein-RNA decoys.

Prediction Algorithms
Almost all popular machine learning methods have been used for prediction of RNA-binding sites or RBPs. Generally, the machine learning methods obtain satisfactory performance with valid sequence-and/or structure-based features participation. The machine learning methods frequently used for RNA-binding research include SVMs [27,67], artificial neural networks (ANN) [68], Bayesian networks [29,67], and random forest [12,69]. Puton et al. [8] have attempted a meta-predictor of RNA-binding residues based on three of the highest ranked sequence-based primary predictors. This meta-predictor outperforms most other predictors. The template-based approach is another algorithm to predict structure of protein-RNA complex when a template structure is available. This method recognizes the putative RBPs by structurally aligning the query protein to RBPs with known structures. SPARKS X [15] is a program which predicts structure based on template-based structure. Similarly, TIM-align [70] is a structural alignment program.

Performance Measures
The parameters commonly used to assess RNA-binding sites and RBP prediction performance include sensitivity, accuracy, strength, specificity, F-measure, precision, the Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC), these parameters are detailed listed in Table 3.
For the formula presented in Table 3, TP represents true positives which are correctly predicted RNA-binding residues; FP indicates false positives which are mistakenly predicted RNA-binding residues; TN denotes true negatives which are correctly predicted non-RNA-binding residues; and FN means false negatives which are wrongly predicted non-RNA-binding residues. Due to the imbalance between positive sample and negative sample, the MCC is regarded as proper measurement to evaluate the overall performance. "MCC = 0" means completely random prediction, and "MCC = 1" indicates perfect prediction. Higher value of MCC (between 0 and 1) represents better prediction accuracy. Another widely used measurement is the receiver operating characteristic (ROC) curve, especially in the comparison of several predictors. The x-axis of ROC curve represents the true positive rate and the y-axis denotes the false positive rate. The larger the area under the curve (AUC), the better the method.

Comparison of Various Prediction Methods
The prediction results of existing methods for RNA-binding sites and RBP predictions are summarized in Table 4. The accuracy of most predictions is approximately 60%-80% and the specificity and sensitivity of these methods range widely. Each method has its own specialty because of the various datasets, input features, and algorithms. Three main datasets are listed in Table 4 including RB75, RB172, and RB344. Several original studies [8,28,71] compared several predictors independently based on the unified dataset and their results are summarized in this manuscript. The MCC is always considered an unbiased measurement and has been calculated in most methods, which helps significantly when comparing the performance among these methods. Subsequently, a meta-predictor that combines three predictors has been developed and has satisfactory performance [8].

Collection of Web Servers of RBPs and RNA-Binding Site Predictors
Many researchers provide web servers when they develop novel methods to predict RNA-binding sites and RBPs. Several protein-RNA complex docking programs are also available. We collected the URLs which are divided into sequence-and structure-based predictors and docking methods ( Table 2). We have tested every web server and labeled them with "˝" or "X" if the web server is available or not, respectively, and noted if the approach is aimed at predicting binding sites or RBPs. Actually, web servers could provide easy-to-use tools to the community. Users could understand the algorithm and conveniently obtain prediction results using web servers. Meanwhile, developers could continually modify their methods with users' feedback.

Conclusions and Future Perspectives
Due to the significant biological roles of several RNA types, RNA-binding site prediction has become more and more important in the area of protein functional site prediction. Prediction accuracy has improved significantly during the past decades and a number of web servers are available to experimental scientists. Nevertheless, the current predictors require further research to improve their effectiveness due to shortcomings.
Three outstanding issues face efforts to predict RNA-binding sites and RBPs. The first important issue is how to distinguish DNA-binding sites from RNA-binding sites. Generally, the prediction approaches that use templates are more effective than those using machine learning methods for distinguishing RBPs from DNA-binding proteins. Conversely, for those RBPs that could not detect successfully using template-based methods, several machine learning methods can detect RNA-binding residues. Therefore, combining the strengths of two approaches has the potential to obtain better performance of RNA-binding site and RBP prediction. The second important issue is that which vectors contribute more and which ones offer less to the mature predictor in machine learning methods remains unclear. It is certain that selection of novel and effective features could be one of the most important concepts in RBPs and RNA-binding site prediction. The third issue is that all existing protein-RNA docking approaches do not take into account conformational changes that may occur in the combination process of protein and RNA molecules. The ability to model the 3D RNA structure using several RNA folding simulations [72][73][74] and accommodating those methods to refold RNA fragments to simulate protein-RNA interaction and optimize minimum energy would be useful [75][76][77][78][79]. Rother et al. [80] successfully combined RNA and protein 3D structures into a unified modeling method. Moreover, further comparison studies are required to adequately evaluate the advantages and disadvantages of various methods.  Probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one AUC " ř n i´1 T i nT