A Max-Margin Model for Predicting Residue—Base Contacts in Protein–RNA Interactions

Protein–RNA interactions (PRIs) are essential for many biological processes, so understanding aspects of the sequences and structures involved in PRIs is important for unraveling such processes. Because of the expensive and time-consuming techniques required for experimental determination of complex protein–RNA structures, various computational methods have been developed to predict PRIs. However, most of these methods focus on predicting only RNA-binding regions in proteins or only protein-binding motifs in RNA. Methods for predicting entire residue–base contacts in PRIs have not yet achieved sufficient accuracy. Furthermore, some of these methods require the identification of 3D structures or homologous sequences, which are not available for all protein and RNA sequences. Here, we propose a prediction method for predicting residue–base contacts between proteins and RNAs using only sequence information and structural information predicted from sequences. The method can be applied to any protein–RNA pair, even when rich information such as its 3D structure, is not available. In this method, residue–base contact prediction is formalized as an integer programming problem. We predict a residue–base contact map that maximizes a scoring function based on sequence-based features such as k-mers of sequences and the predicted secondary structure. The scoring function is trained using a max-margin framework from known PRIs with 3D structures. To verify our method, we conducted several computational experiments. The results suggest that our method, which is based on only sequence information, is comparable with RNA-binding residue prediction methods based on known binding data.


Introduction
Recent studies have begun unraveling the mechanisms of biological processes involving functional non-coding RNAs, most of which interact with RNA-binding proteins (RBPs) in essential roles, such as splicing, transport, localization, and translation. These interactions involve sequence-and structure-specific recognition between proteins and RNAs. Therefore, understanding aspects of sequences and structures involved in protein-RNA interactions (PRIs) is important for understanding many biological processes. To that end, several studies have focused on the analysis and discussion of PRIs [1][2][3].
Compared with deciphering genomic sequences by using high-throughput sequencing technology, experimental determination of protein-RNA joint structures is both more expensive and more time consuming. Accordingly, rapid computational prediction of PRIs from only sequence information is desirable. Existing methods for computational prediction of PRIs can be roughly classified into four groups. The first group predicts whether a given protein-RNA pair interacts or not [4][5][6][7]. A prediction algorithm for this approach can be simply designed from interacting protein-RNA pairs alone, so 3D structures and residuebase contacts are not necessary for use in model training. However, this approach cannot predict binding sites of proteins and RNAs that should be biologically and structurally essential for PRIs. The second group aims to predict RNA-binding residues from protein information. DR_bind1 [8], KYG [9], and OPRA [10] are structure-based methods that use Life 2021, 11, 1135 2 of 13 3D structures from PDB to extract descriptors for prediction. BindN+ [11] and Pprint [12] are sequence-based methods that employ evolutionary information instead of 3D structures. However, this approach ignores the binding partners of target proteins, although some RNA-binding domains in RBPs recognize sequence-and structure-specific motifs in RNA sequences. The third group computes RNA structural motifs recognized by RNA-binding domains in certain proteins and includes MEMERIS [13], RNAcontext [14], CapR [15], and GraphProt [16]. This approach focuses on a certain RBP and extracts RNA motifs as consensus sequences and/or secondary structures of the RBP-binding RNAs. The fourth and final group of methods predicts intermolecular joint structures between proteins and RNAs such as residue-base contacts. To our knowledge, Hayashida et al. [17] have developed the only method of this type. However, it is unfortunately not sufficiently accurate.
Accordingly, we propose a prediction method for residue-base contacts between proteins and RNAs based only on sequence information and structural information predicted from sequences. Our method can be applied to any protein-RNA pair, including those for which rich information, such as 3D structures, are unavailable. Residue-base contact prediction is formalized as an integer programming (IP) problem. Our method predicts a residue-base contact map that maximizes a scoring function based on sequence features such as k-mers of sequences and predicted secondary structures. The scoring function is trained by a max-margin framework from known PRIs with 3D structures. To verify our method, we performed several computational experiments. The results suggest that our method based on only sequence information is comparable with RNA-binding residue prediction methods based on actual known binding data.

Methods
We present a novel algorithm for predicting PRIs using IP. Our algorithm consists of the following two parts: (1) prediction of a residue-base contact map given a protein and RNA pair by solving an integer programming problem; and (2) learning a scoring function from a given training dataset using a max-margin framework.

Preliminaries
Let Σ p represent the set of 20 canonical amino acid residues and let Σ * p denote the set of all finite amino acid sequences consisting of residues in Σ p . Similarly, let Σ r represent the set of the four canonical ribonucleotide bases (A, C, G, and U) and let Σ * r denote the set of all finite RNA sequences consisting of bases in Σ r . Given a protein P = p 1 · · · p |P| ∈ Σ * p consisting of |P| residues and an RNA R = r 1 · · · r |R| ∈ Σ * r consisting of |R| bases, let CM(P, R) represent the space of all possible residue-base contact maps between P and R. An element z ∈ CM(P, R) is represented as an |P| × |R| binary-valued matrix, where z ij = 1 indicates that residue p i interacts with the base r j (Figure 1). We define the problem of PRI prediction as follows: given a protein P and an RNA R, predict a residue-base contact map z ∈ CM(P, R).

Scoring Model
A scoring model f is a function that assigns real-valued scores to protein-RNA pairs (P, R) and residue-base contact maps z ∈ CM(P, R). Our aim is to find a residue-base contact map z ∈ CM(P, R) that maximizes the scoring function f (P, R, z) for a given protein-RNA pair (P, R). The scoring function f (P, R, z) is computed on the basis of various local features of P, R, and z. These features correspond to residue features, base features, and residue-base contact features that describe local contexts around residue-base contacts, respectively.
Residue features, as summarized in Table 1, describe the binding preference in the amino acid sequences by local contexts around residue-base contacts. For this purpose, we employ k-mers of the amino acids centered on the interacting ith residue. For each k-mer of the amino acids, p kmer ∈ Σ k p , we define a binary-valued local feature of the ith residue as φ p kmer (P, z, i) = I(kmer(P, i) = p kmer )I(x i = 1), where I(condition) is an indicator function that takes a value of 1 or 0 depending on whether the condition is true or false, kmer(P, i) is the k-mer of the substring of P centered on the ith residue p i , that is, kmer(P, i) = p i−(k−1)/2 . . . p i . . . p i+(k−1)/2 , and x i is a binaryvalued variable such that x i = 1 if and only if residue p i is a binding site (Figure 1), that is, ∑ |R| j=1 z ij ≥ 1. We use k = 3 and k = 5 to characterize k-mer features. To reduce the sparsity of amino acid contexts, we consider the k-mers of simplified alphabets of amino acids proposed by Murphy et al. [18], who calculated groups of simplified alphabets based on the BLOSUM50 matrix [19]. Note that Murphy et al. [18] have shown that the simplified alphabets are correlated with physiochemical properties such as hydrophobicity, hydrophilicity, and polarity, which may have important roles in PRIs. We employ the simplified alphabets of 10 groups, Σ g10 , and those of 4 groups, Σ g4 (Table 2). For each string sa kmer ∈ Σ k g10 (or Σ k g4 ), we define a binary-valued local feature of the ith residue as φ sa kmer (P, z, i) = I(kmer(P sa , i) = sa kmer )I( where P sa is the string of simplified alphabets Σ g10 (or Σ g4 ) converted from P according to Table 2. In contrast with the k-mers used in other part of this algorithm, we instead use k = 5 and k = 7 for the k-mers of simplified alphabets.
To consider the structural preference of RNA-binding residues, we employ secondary structures predicted by SSpro8 [20]. We predict one structural element [α-helix (H), 3-helix (G), 5-helix (I), folded (E), β-turn (B), corner (T), curl (S), and loop (-)] for each residue. For each string sp kmer of structural elements of length k, we define a binary-valued local feature of the ith residue as where P sp is the string of structural elements predicted from P. Here, we again use structural contexts with lengths k = 3 and k = 5.
The collection of occurrences of the residue features are calculated as where φ p (P, z, i) is a vector whose elements are the residue features of the ith residue mentioned above. Base features, as summarized in Table 3, describe the binding preference in the ribonucleotide sequences by local contexts around residue-base contacts. In addition to the residue features, we employ the k-mer contexts of the ribonucleotides centered on the interacting jth base. For each k-mer of the ribonucleotides r kmer ∈ Σ k r , we define a binary-valued local feature of the jth base as where y j is a binary-valued variable such that y j = 1 if and only if the residue r j is a binding site (Figure 1 Here, we once again use k = 3 and 5 for the k-mer features. To consider the structural preference of binding sites, we employ secondary structures predicted by CENTROIDFOLD [21]. We assign a structural element [external loop (E), hairpin loop (H), internal loop (I), bulge (B), multibranch loop (M), or stack (S), as shown in Figure 2] to each base. Note that to encode secondary structures as a sequence, this encoding of structural profiles loses a portion of the structural information, e.g., basepairing partners for stacking bases. However, this approach is still efficient for describing structural information [13][14][15]. For each k-length string sr kmer of structural elements, we define a binary-valued local feature of the jth base as φ sr kmer (R, z, j) = I(kmer(R sr , j) = sr kmer )I(y j = 1), where R sr is the string of structural elements predicted from R. Here, we use structural contexts with lengths k = 3 and k = 5.   The collection of occurrences of the base features are calculated as where φ r (R, z, j) is a vector whose elements are the base features of the jth base mentioned above.
Residue-base contact features, which are summarized in Table 4, describe the binding affinity between the local contexts of amino acids and ribonucleotides. For this purpose, we employ combinations of the residue features and the base features mentioned above. For example, for each pair of k-mers of amino acids p kmer and ribonucleotides r kmer , we define a binary-valued local feature of the ith residue and the jth base: φ p kmer ,r kmer (P, R, z, i, j) = I(kmer(P, i) = p kmer )I(kmer(R, j) = r kmer )I(z ij = 1). The collection of occurrences of the residue-base contact features are calculated as where φ c (P, R, z, i, j) is a vector whose elements are the residue-base contact features of the ith residue and the jth base mentioned above. The notation Φ(P, R, z) denotes the feature representation of protein-RNA pair (P, R) and its residue-base contact map z ∈ CM(P, R), that is, the collection of occurrences of local features in P, R, and z defined as follows: Each feature in Φ is associated with a corresponding parameter, and the score for the feature is defined as the value of the occurrence multiplied by the corresponding parameter. We define the scoring model f (P, R, z) as a linear function where ·, · is the inner product and λ = (λ p , λ r , λ c ) is the corresponding parameter vector trained with training data as described in Section 2.4.

IP Formulation
To formulate the problem as an IP problem, we rewrite the scoring function (5) as where u i , v i , and w ij represent the binding preferences for x i , y j , and z ij , respectively, calculated as We find a z ∈ CM(P, R) that maximizes the objective function (6) under the following constraints to ensure consistency among the variables x i , y j , and z ij as follows: The constraints defined by Equations (7)-(9) describe the relation between contacts z ij and binding sites x i , y j . The constraint defined by Equation (10) disallows any isolated interacting bases, which are rare in PRIs. The constraints defined by Equations (11) and (12) define the upper bound on the number of contacts X i and Y j for each residue and base, respectively.

Learning Algorithm
To optimize feature parameter λ, we employ a max-margin framework called structured support vector machines [22]. Given a training dataset D = {(P (k) , R (k) , z (k) )} K k=1 , where P (k) and R (k) are the protein and RNA sequences, respectively, and z (k) ∈ CM(P (k) , R (k) ) is their corresponding contact map for the kth datapoint, we aim to find the parameter λ that minimizes the objective function where ||.|| 1 is the 1 norm and C is a weight for the 1 regularization term to avoid overfitting to the training data. Here, ∆(z,ẑ) is a loss function ofẑ for z defined as ∆(z,ẑ) =δ FN residue (# of false negative residues) (14) + δ FP residue (# of false positive residues) + δ FN base (# of false negative bases) + δ FP base (# of false positive bases) where δ FN residue , δ FP residue , δ FN base , δ FP base , δ FN contact , and δ FP contact are hyperparameters controlling the trade-off between sensitivity and specificity for learning the parameters. In this case, we can calculate the first term of Equation (13) by replacing scores u i , v j , and w ij in Equation (6) as follows:ū See Section S1 in the Supplementary Material for the derivation.

Implementation
Our method was implemented using the IBM CPLEX optimizer http://www.ibm. com/software/integration/optimization/cplex-optimizer/) (accessed on 21 October 2021) for solving IP problems (6)- (12). To extract the structural feature elements described in Section 2.2, we employed SSpro8 [20] and CENTROIDFOLD [21] to predict secondary structures of protein and RNA sequences, respectively. We empirically chose the following hyperparameters: penalty for positives, δ FN * = 0.5; penalty for negatives, δ FP * = 0.005; and the weight for the 1 regularization term, C = 10 −5 . See Section S2 in the Supplementary Material for details. We implemented AdaGrad [24] to control the learning rate η in Algorithm 1. The source code for our algorithm is available at https://github.com/keiobioinformatics/practip/ (accessed on 21 October 2021).

Dataset
We prepared our datasets in accordance with those of Chen et al. [8] and Miao et al. [25] and extracted RNA-bound proteins with an X-ray resolution of ≤3.0 Å from the Protein Data Bank (PDB) [26]. To reduce dataset redundancy, we discarded some extracted data such that the dataset contained no protein pairs whose sequence identity was >30%. As a result, our test dataset consisted of 98 protein-RNA interacting pairs from 81 protein-RNA complexes from Chen et al. [8] as listed in Table S6 in the Supplementary Matrial, and our training dataset consisted of 4399 protein-RNA interacting pairs from 772 protein-RNA complexes was from Miao et al. [25]. Note that our training data and test data share no common complexes. We considered a residue to bind RNA if at least one non-hydrogen atom was contained within the van der Waals contact (4.0Å) or hydrogen-bonding distance (3.5Å) from the non-hydrogen atom of its binding partner. We employed HBPLUS [27] to detect the hydrogen bonds and van der Waals contacts. Our datasets are available at https://doi.org/10.5281/zenodo.5584470 (accessed on 21 October 2021).

Prediction of Residue-Base Contacts
To validate our method, we conducted computational experiments on our dataset, comparing the accuracy under several conditions related to the maximum number of contacts for each residue and base, X i and Y j in Equations (11) and (12) from 1 to 9, and no upper bounds.
We evaluated the accuracy of predicting residue-base contacts between proteins and RNAs using three measures: predicted residue-base contacts, binding residues in proteins, and binding bases in RNA sequences. The accuracy of residue-base contacts is assessed by the positive predictive value (PPV) and the sensitivity (SEN), respectively defined as where TP is the number of correctly predicted contacts (true positives), FP is the number of incorrectly predicted contacts (false positives), and FN is the number of contacts in the true contact map that were not predicted (false negatives). We also used the F-value as a balanced measure between PPV and SEN, and it is defined as their harmonic mean: The accuracy of binding residues and binding bases is defined in the same way. Table 5 shows the accuracy of predicting residue-base contacts in PRIs, binding residues in proteins, and binding bases in RNA sequences for upper bounds of contacts X i , Y j in Equations (11) and (12) from 1 to 9 and for no upper bounds. The case with the strongest constraint (X i = Y j = 1) has a very high PPV because it limits the number of contacts to be predicted, while its SEN is poor because of a lack of coverage of the prediction. On the other hand, if there is no constraint on the number of contacts (corresponding to the row labeled "no limit" in Table 5), both PPV and SEN are not high owing to many incorrect predictions being made. We found that if the upper limit of the number of contacts is set between 4 and 9, reasonably accurate contact prediction, residue binding site prediction, and base binding site prediction can be obtained. As a result, we set X i = Y j = 8 as the default constraint for the upper bound of the number of contacts.  It should be noted that in this experiment, we were unable to compare our method with the method by Hayashida et al. [17], which is the only published method for predicting residue-base contacts in PRIs. Specifically, we were unable to conduct an experiment using the method by Hayashida et al. on the same dataset because their software implementation is not yet available and their method requires homologous sequences with accurate alignments to calculate evolutionary information. In addition, Hayashida et al. [17] have reported that the method is not sufficiently accurate for such analyses.

Comparison of Binding Residues Predictions among the Present and Existing Methods
We compared our method with existing methods for predicting RNA-binding residues in proteins. DR_bind1 [8], KYG [9], and OPRA [10] are structure-based methods that use 3D structures from PDB to extract descriptors for prediction. BindN+ [11] and Pprint [12] are sequence-based methods that employ evolutionary information instead of 3D structures. Table 6 indicates that our method is comparable to other methods. Recall that our method employs only sequence information and structural information predicted from sequences as well as information on the partner RNAs bound to RNA-binding proteins, rather than 3D structures and evolutionary information.

Discussion
Several existing methods for predicting PRIs utilize evolutionary information from homologous sequences, [11,12] for protein sequences and [17] for both protein and RNA sequences. Homologous sequences of target sequences are typically searched for in large databases using a highly sensitive homology search engine such as PSI-BLAST [28]. Furthermore, to extract evolutionary information, homologous sequences must be aligned before PRI prediction. Homology searches are employed in a wide range of analyses, such as functional analysis of proteins, because if homologous proteins can be found in curated databases, the function of the target protein can be easily inferred. However, as described above and by Zhang et al. [29], the secondary structures of proteins play essential roles in residue-base contacts. Similarly, structural elements of RNA secondary structures also serve as key descriptors for residue-base contact prediction [13][14][15][16]. This means that structure-based homology searches are needed for PRI prediction based on evolutionary information. Although efficient structural alignment algorithms for proteins (e.g., [30]) and RNAs (e.g., [31]) have recently been developed, they have not yet been successfully applied to large-scale homology searches.
To our knowledge, Hayashida et al. [17] have developed the only existing method that predicts intermolecular joint structures between proteins and RNAs such as residue-base contacts; however, this method is unfortunately not sufficiently accurate. The method by Hayashida et al. [17] is similar to our method in that its approach is based on a machine learning technique with 1 regularization. The main difference between our method and the method by Hayashida et al. [17] is that our method employs a large number of features, including structural information about proteins and RNAs, which have been shown to serve as key descriptors of PRIs as mentioned above.
We utilized the structural profiles of predicted RNA secondary structures, which does lose an important part of structural information, such as base-pairing partners for stacking bases. Most of the existing RBP-binding RNA motif finding methods [13][14][15] have also utilized similar encoding, which may not be suitable for dealing with the recognition sites of double-stranded RNA-binding proteins. GraphProt [16] is an exceptional algorithm that utilizes graph-based encoding of RNA secondary structures. Our method should be extended by utilizing another structural profile with no loss of base pairing information like the graph-based encoding of GraphProt.
To predict the secondary structure of RNA and amino acid sequences, we employed CENTROIDFOLD [21] and SSPro8 [20], which are standard tools, respectively. Since our method takes as input the results of secondary structure prediction, the prediction error may propagate to the residue-base contact prediction and worsen the prediction accuracy. The accuracy of our method could be improved by exploring various combinations of prediction methods, including the state-of-the-art secondary structure prediction methods such as MXfold2 [32] and DeepCNF [33].
As shown in Section 2.3, we formulated the residue-base contact prediction as an IP problem, which enables us to build a flexible model, including, for example, constraints on the upper bound on the number of contacts for each residue and base. In contrast to the RNA-RNA interaction model [34,35] in which each base interacts with at most one base via hydrogen bonds such as Watson-Crick and wobble base pairs, PRIs contain diverse patterns of residue-base contacts. For example, Kondo et al. have classified residue-base contacts with respect to three interaction edges on nucleotides (Watson-Crick, Hoogsteen, and sugar) with side-chains and backbones of their partner residues, and have analyzed their propensities [1]. Thus, there is room for further improvement of our model, which can be extended by using other constraints for each contact between a residue and a base to include such considerations.
In terms of the formulation as the integer programming problem, the RNA-RNA interaction prediction model [34,35] and our model for protein-RNA interaction prediction proposed in this paper are quite similar. In the RNA-RNA interaction prediction model, the probability distribution of RNA-RNA interactions can be calculated (even though it is an approximation), and thus the number of variables to be handled in the integer programming problem can be greatly reduced by using a technique called the threshold cut, which has succeeded in reducing the computation time. However, since such probability distributions are not known so far for protein-RNA interactions, there is no breakthrough technique that can significantly speed up the process like threshold cut. Therefore, speeding up our method is one of the future challenges for large-scale screening of protein-RNA interactions.
The large-scale sequencing data produced by RNA-related high-throughput sequencing technologies, such as Structure-seq [36] and hiCLIP [37], will help us improve our algorithm, especially by providing data for training the model. In the present work, we employed complete joint 3D structures of proteins and RNAs as the training dataset, which was not sufficiently large. We cannot build from large-scale sequencing data a complete dataset with residue-base contact maps, but we can partially calculate structural profiles and binding bases from in vivo chemical probing data such as Structure-seq datasets. This information will significantly help us improve our model.
Deep learning has been increasingly used in various fields, including bioinformatics, in recent years. Wei et al. [38] have provided a review of the use of deep learning in RNAprotein interaction prediction. Yamada et al. [39] have developed a method to accurately identify RNA sequences that interact with a particular protein by using the DNABERT model [40] that is pre-trained using the human genome. Although our method does not use deep learning, we expect to achieve higher accuracy in prediction by using a pretrained BERT model, which could be improved through the application of deep learning relatively easily.

Conclusions
We developed a max-margin framework for predicting residue-base contacts between proteins and RNAs based on integer programming. To verify our method, we performed several computational experiments. The results suggest that our method based only on sequence information and structural information predicted from sequences is comparable with RNA-binding residue prediction methods based on known binding data. Further improvements are needed, such as the incorporation of informative features, the development of a joint prediction model that simultaneously predicts RNA secondary structures and protein contact maps, and the utilization of high-throughput sequencing data that can deal with PRI without residue-base contact information as training data.