We present a novel algorithm for predicting PRIs using IP. Our algorithm consists of the following two parts: (1) prediction of a residue–base contact map given a protein and RNA pair by solving an integer programming problem; and (2) learning a scoring function from a given training dataset using a max-margin framework.
2.2. Scoring Model
A scoring model f is a function that assigns real-valued scores to protein–RNA pairs $(P,R)$ and residue–base contact maps $z\in \mathcal{CM}(P,R)$. Our aim is to find a residue–base contact map $z\in \mathcal{CM}(P,R)$ that maximizes the scoring function $f(P,R,z)$ for a given protein–RNA pair $(P,R)$. The scoring function $f(P,R,z)$ is computed on the basis of various local features of $P,R$, and z. These features correspond to residue features, base features, and residue–base contact features that describe local contexts around residue–base contacts, respectively.
Residue features, as summarized in
Table 1, describe the binding preference in the amino acid sequences by local contexts around residue–base contacts. For this purpose, we employ
k-mers of the amino acids centered on the interacting
ith residue. For each
k-mer of the amino acids,
${p}_{kmer}\in {\mathrm{\Sigma}}_{p}^{k}$, we define a binary-valued local feature of the
ith residue as
where
$I\left(condition\right)$ is an indicator function that takes a value of 1 or 0 depending on whether the
$condition$ is true or false,
$kmer(P,i)$ is the
k-mer of the substring of
P centered on the
ith residue
${p}_{i}$, that is,
$kmer(P,i)={p}_{i-(k-1)/2}\dots {p}_{i}\dots {p}_{i+(k-1)/2}$, and
${x}_{i}$ is a binary-valued variable such that
${x}_{i}=1$ if and only if residue
${p}_{i}$ is a binding site (
Figure 1), that is,
${\sum}_{j=1}^{\left|R\right|}{z}_{ij}\ge 1$. We use
$k=3$ and
$k=5$ to characterize
k-mer features.
To reduce the sparsity of amino acid contexts, we consider the
k-mers of simplified alphabets of amino acids proposed by Murphy et al. [
18], who calculated groups of simplified alphabets based on the BLOSUM50 matrix [
19]. Note that Murphy et al. [
18] have shown that the simplified alphabets are correlated with physiochemical properties such as hydrophobicity, hydrophilicity, and polarity, which may have important roles in PRIs. We employ the simplified alphabets of 10 groups,
${\mathrm{\Sigma}}_{g10}$, and those of 4 groups,
${\mathrm{\Sigma}}_{g4}$ (
Table 2).
For each string
$s{a}_{kmer}\in {\mathrm{\Sigma}}_{g10}^{k}$ (or
${\mathrm{\Sigma}}_{g4}^{k}$), we define a binary-valued local feature of the
ith residue as
where
${P}_{sa}$ is the string of simplified alphabets
${\mathrm{\Sigma}}_{g10}$ (or
${\mathrm{\Sigma}}_{g4}$) converted from
P according to
Table 2. In contrast with the
k-mers used in other part of this algorithm, we instead use
$k=5$ and
$k=7$ for the
k-mers of simplified alphabets.
To consider the structural preference of RNA-binding residues, we employ secondary structures predicted by SSpro8 [
20]. We predict one structural element [
$\alpha $-helix (H), 3-helix (G), 5-helix (I), folded (E),
$\beta $-turn (B), corner (T), curl (S), and loop (–)] for each residue. For each string
$s{p}_{kmer}$ of structural elements of length
k, we define a binary-valued local feature of the
ith residue as
where
${P}_{sp}$ is the string of structural elements predicted from
P. Here, we again use structural contexts with lengths
$k=3$ and
$k=5$.
The collection of occurrences of the residue features are calculated as
where
${\mathbf{\varphi}}_{p}(P,z,i)$ is a vector whose elements are the residue features of the
ith residue mentioned above.
Base features, as summarized in
Table 3, describe the binding preference in the ribonucleotide sequences by local contexts around residue–base contacts. In addition to the residue features, we employ the
k-mer contexts of the ribonucleotides centered on the interacting
jth base. For each
k-mer of the ribonucleotides
${r}_{kmer}\in {\mathrm{\Sigma}}_{r}^{k}$, we define a binary-valued local feature of the
jth base as
where
${y}_{j}$ is a binary-valued variable such that
${y}_{j}=1$ if and only if the residue
${r}_{j}$ is a binding site (
Figure 1), that is,
${\sum}_{i=1}^{\left|P\right|}{z}_{ij}\ge 1$. Here, we once again use
$k=3$ and 5 for the
k-mer features.
To consider the structural preference of binding sites, we employ secondary structures predicted by
CentroidFold [
21]. We assign a structural element [external loop (E), hairpin loop (H), internal loop (I), bulge (B), multibranch loop (M), or stack (S), as shown in
Figure 2] to each base. Note that to encode secondary structures as a sequence, this encoding of structural profiles loses a portion of the structural information, e.g., base-pairing partners for stacking bases. However, this approach is still efficient for describing structural information [
13,
14,
15]. For each
k-length string
$s{r}_{kmer}$ of structural elements, we define a binary-valued local feature of the
jth base as
where
${R}_{sr}$ is the string of structural elements predicted from
R. Here, we use structural contexts with lengths
$k=3$ and
$k=5$.
The collection of occurrences of the base features are calculated as
where
${\mathbf{\varphi}}_{r}(R,z,j)$ is a vector whose elements are the base features of the
jth base mentioned above.
Residue–base contact features, which are summarized in
Table 4, describe the binding affinity between the local contexts of amino acids and ribonucleotides. For this purpose, we employ combinations of the residue features and the base features mentioned above. For example, for each pair of
k-mers of amino acids
${p}_{kmer}$ and ribonucleotides
${r}_{kmer}$, we define a binary-valued local feature of the
ith residue and the
jth base:
The collection of occurrences of the residue–base contact features are calculated as
where
${\mathbf{\varphi}}_{c}(P,R,z,i,j)$ is a vector whose elements are the residue–base contact features of the
ith residue and the
jth base mentioned above.
The notation
$\mathrm{\Phi}(P,R,z)$ denotes the feature representation of protein–RNA pair
$(P,R)$ and its residue–base contact map
$z\in \mathcal{CM}(P,R)$, that is, the collection of occurrences of local features in
$P,R$, and
z defined as follows:
Each feature in
$\mathrm{\Phi}$ is associated with a corresponding parameter, and the score for the feature is defined as the value of the occurrence multiplied by the corresponding parameter. We define the scoring model
$f(P,R,z)$ as a linear function
where
$\langle \xb7,\xb7\rangle $ is the inner product and
$\mathbf{\lambda}={({\mathbf{\lambda}}_{p}^{\top},{\mathbf{\lambda}}_{r}^{\top},{\mathbf{\lambda}}_{c}^{\top})}^{\top}$ is the corresponding parameter vector trained with training data as described in
Section 2.4.