RPITER: A Hierarchical Deep Learning Framework for ncRNA–Protein Interaction Prediction

Non-coding RNAs (ncRNAs) play crucial roles in multiple fundamental biological processes, such as post-transcriptional gene regulation, and are implicated in many complex human diseases. Mostly ncRNAs function by interacting with corresponding RNA-binding proteins. The research on ncRNA–protein interaction is the key to understanding the function of ncRNA. However, the biological experiment techniques for identifying RNA–protein interactions (RPIs) are currently still expensive and time-consuming. Due to the complex molecular mechanism of ncRNA–protein interaction and the lack of conservation for ncRNA, especially for long ncRNA (lncRNA), the prediction of ncRNA–protein interaction is still a challenge. Deep learning-based models have become the state-of-the-art in a range of biological sequence analysis problems due to their strong power of feature learning. In this study, we proposed a hierarchical deep learning framework RPITER to predict RNA–protein interaction. For sequence coding, we improved the conjoint triad feature (CTF) coding method by complementing more primary sequence information and adding sequence structure information. For model design, RPITER employed two basic neural network architectures of convolution neural network (CNN) and stacked auto-encoder (SAE). Comprehensive experiments were performed on five benchmark datasets from PDB and NPInter databases to analyze and compare the performances of different sequence coding methods and prediction models. We found that CNN and SAE deep learning architectures have powerful fitting abilities for the k-mer features of RNA and protein sequence. The improved CTF coding method showed performance gain compared with the original CTF method. Moreover, our designed RPITER performed well in predicting RNA–protein interaction (RPI) and could outperform most of the previous methods. On five widely used RPI datasets, RPI369, RPI488, RPI1807, RPI2241 and NPInter, RPITER obtained AUC of 0.821, 0.911, 0.990, 0.957 and 0.985, respectively. The proposed RPITER could be a complementary method for predicting RPI and constructing RPI network, which would help push forward the related biological research on ncRNAs and lncRNAs.


Introduction
In human genome, protein-coding genes only account for about 2% and the vast majority are non-coding RNAs (ncRNAs), which directly function at the RNA level [1,2]. Based on transcript lengths, ncRNAs could be roughly divided into small ncRNAs below 200 nucleotides, such as siRNAs, miRNAs, and piRNAs, and long non-coding RNAs (lncRNAs) over 200 nucleotides [3][4][5][6][7]. Recently, increasing evidence suggests that lncRNAs play a crucial role in the physiological processes, such as epigenetic regulation of gene expression [8], cell cycle regulation [9], and chromatin modification [10], as well as pathological processes, such as cancer [11][12][13], diabetes [14], and Alzheimer's disease [15]. protein sequence encoding methods and discussed how to use deep learning techniques including CNN, LSTM and attention mechanism to tackle the biological sequence problems such as prediction of subcellular localization and protein secondary structure. Xu et al. introduced a novel DNA sequence encoding method by training k-mer embedding with Glove [38], a word embedding approach in natural language processing (NLP) field, and then predicted chromatin accessibility based on the proposed encoding method and via CNN and LSTM network. In view of the state-of-the-art performance of deep learning-based methods on all kinds of biological sequence problems, it is feasible and valuable to develop the specific deep learning model to implement the RPI prediction.
Deep learning allows computational models that are composed of multiple processing layers to automatically discover representations of data with multiple levels of abstraction, which is one major advantage compared with the domain-specific feature engineering essential for conventional machine learning methods [39]. CNN in deep learning is good at extracting features from input data, and multiple convolution and pooling layers enable the CNN network to form different levels of feature abstraction. When dealing with raw biological sequence data, CNN is suited to locate motifs. Recurrent neural network (RNN) in deep learning processes every step in the sequential input recurrently, thus is ideally applied to speech, text and biological sequence data. LSTM, a variant of RNN with the memory cell, is a popular architecture due to its effectiveness in learning the long-term sequence dependency. Word2vec [40], similar to Glove, is an efficient approach in NLP for computing numeric vector representations of words, word embedding, from raw text. If we regard the k-mer information of biological sequence data as the word in natural language, we can naturally train the k-mer embedding using the word2vec tool, which could provide a promising sequence encoding method. SAE is composed of multiple auto-encoders by layer-wise unsupervised learning [41]. It can automatically learn high-level features from raw data to form reduced dimensional representation, which was used by Sun et al. [42] to process protein sequence vector coded by conjoint triad feature method and auto-covariance method to finally predict protein-protein interactions.
In this study, we proposed a fully deep learning-based hierarchical model, RPITER, which utilizes the sequence and structure information of RNA and protein to predict the ncRNA-protein interactions. The whole proposed model consists of four modules for analyzing the inputs from two sequence coding parts and an ensemble module for integrating the outputs from basic modules to make a comprehensive prediction result. The whole architecture is shown in Figure 1. First, apply two sequence encoding methods to code the RNA and protein sequences with or without structure information; Second, use CNN and SAE-based modules to form high-level feature representations and output prediction results. Third, rely on ensemble module to combine the previous prediction results of four basic modules and further improve the prediction performance.

Results
Comprehensive experiments were performed to compare the performance of different sequence coding methods and RPI prediction methods. We employed six metrics to compare the method performance, namely accuracy (Acc), sensitivity (Sn), specificity (Sp), precision (Pre), Matthews correlation coefficient (MCC), and AUC (the area under the receiver operating characteristic curve (ROC)).

Performance Comparison between Different Sequence Coding Methods
We improved the sequence coding method CTF by complementing more primary sequence information and sequence structure information, noted as Improved CTF and Improved Struct CTF (described in detail in Section 4.2). The performances of three sequence coding methods in five-fold cross validation (CV) on dataset RPI2241 are shown in Table 1 and Figure 2a. CTF yielded an Acc of 0.848, Sn of 0.826, Sp of 0.869, Pre of 0.864, MCC of 0.697 and AUC of 0.929. Improved CTF and Improved Struct CTF achieved the same Acc of 0.852, 0.004 higher than the 0.848 obtained by CTF. For Sn, Sp, Pre, MCC and AUC, Improved CTF and Improved Struct CTF also achieved similar performances and showed small advantages over CTF. As shown in Figure 2a, the ROC curves of three methods were very close but the curves of Improved CTF (orange) and Improved Struct CTF (green) were slightly higher than the curve of CTF (blue).   Moreover, we tried three commonly used deep learning sequence coding methods: one hot, word2vec and doc2vec [43]. One hot coding encodes each element in sequence into a binary vector and then concatenates all binary vectors successively to form the final sequence coding representation. One hot coding is extensively used for sequence coding in the deep learning-based models for biological sequence analysis [27,31,37]. Word2vec coding views k-mer in biological sequence as word in natural language and encodes the biological sequence by concatenating the k-mer representation vectors successively. Doc2vec coding directly transforms a biological sequence into a fix length vector of its embedding representation. The sequence structure information has also been considered in these three kinds of coding methods (the coding methods with "Struct" in name in Figure 2b). The ROC and AUC of these coding methods on dataset RPI2241 in five-fold CV are shown in Figure 2b. On RPI2241, one hot coding is superior to word2vec coding for having higher AUC, and word2vec coding is better than doc2vec coding. The sequence structure information notably improved the performance of doc2vec coding. The highest AUC achieved by these three coding methods is 0.872 (one hot coding), but the CTF, Improved CTF and Improved Struct CTF coding methods obtained obviously better AUC of 0.929, 0.934 and 0.931, respectively.

Performance Comparison between Different Basic Prediction Models
RF and SVM were employed to build RPISeq's classifiers; IPMiner fed the RNA and protein features into RF to make predictions. Our proposed RPITER employed CNN and SAE architectures in four basic prediction modules to map input features into high-level feature representations and generate prediction results. RPISeq, IPMiner and RPITER all rely on the CTF form sequence coding features. Therefore, we compared the performances of different prediction models, namely RF, SVM, CNN and SAE, using same CTF coding features on datasets RPI1807, RPI2241 and NPInter. For SVM, we adopted the parameter setting in [21] (kernel = "polynomial", degree = 2, C = 1.0, and tolerance = 0.001). For RF, we set the number of decision trees to 100. The Acc of four basic prediction models on the three datasets in five-fold CV are shown in Figure 3. The detailed performances of these methods on dataset NPInter are listed in Table 2.

Performance Comparison between Different Modules of RPITER
Our whole model RPITER consist of four basic prediction modules, namely Conjoint-CNN, Conjoint-SAE, Conjoint-Struct-CNN, and Conjoint-Struct-SAE (described in detail in Section 4.4). The prediction accuracy of our four basic modules and whole architecture RPITER on five benchmark datasets by five-fold CV are shown in  Comparing the performance of modules using different architectures of CNN and SAE but the same sequence coding inputs, it was found that the CNN-based modules have advantages over the SAE-based modules when the training samples are sufficient. On the two smallest datasets RPI369 and RPI488, the Acc of Conjoint-CNN and Conjoint-Struct-CNN were slightly inferior to Conjoint-SAE and Conjoint-Struct-SAE. On the two medium size datasets RPI1807 and RPI2241 and the largest dataset NPInter, the Acc of Conjoint-CNN and Conjoint-Struct-CNN had advantages over Conjoint-SAE and Conjoint-Struct-SAE. On RPI2241, Conjoint-CNN and Conjoint-Struct-CNN obtained 0.885 and 0.879 Acc, respectively, which notably better than Conjoint-SAE (0.859) and Conjoint-Struct-SAE (0.857). On NPInter, Conjoint-CNN (0.953) and Conjoint-Struct-CNN (0.953) both yielded a 0.011 higher Acc than Conjoint-SAE (0.942) and Conjoint-Struct-SAE (0.942).
No individual module can always surpass all other modules on all datasets. The architecture of our proposed RPITER combines the advantages of four basic modules with different architectures and sequence coding methods, which can provide a more comprehensive RPI prediction result.

Performance Comparison with Other Previous RNA-Protein Interaction (RPI) Prediction Methods
To further evaluate our proposed RPITER, we also compared our method with other previous RPI prediction methods. Since catRAPID and RPI-Pred do not offer standalone packages, we could only compared our methods with RPISeq, lncPro and IPMiner. For RPISeq, Muppirala et al. [21] proposed RPISeq-RF and RPISeq-SVM using RF and SVM, respectively, but RPISeq-RF outperformed RPISeq-SVM on RPI369 and RPI2241 according to their reported performances. Thus, in this study, RPISeq-RF was chosen for comparison. lncPro only provided the prediction source code of a trained model on their dataset. Thus, we directly tested the performance of their model on our five benchmark datasets. For IPMiner, RPISeq-RF and RPITER, the same data pre-processing and k-mer counting manner were conducted under the same five-fold CV condition. The prediction accuracy of the above four methods on five datasets are shown in Figure 5, and the detailed performance are listed in Table 3.
As shown in Figure 5, on datasets RPI369 and RPI2241, RPITER had a notable accuracy increase over all other methods, and the standard deviations of accuracy in five-fold CV (gray lines on the top of accuracy columns) were obviously smaller than IPMiner and RPISeq-RF. On datasets RPI488 and NPInter, RPITER also performed better than RPISeq-RF and lncPro and had similar prediction accuracy as IPMiner. On dataset RPI1807, RPITER, IPMiner and RPISeq-RF achieved similar accuracy and all outperformed lncPro. Moreover, on datasets RPI369, RPI1807, RPI2241 and NPInter, RPITER yielded the highest Sn and AUC among all methods. It should be noted that the dataset used to train the lncPro model overlapped with RPI488 [26]. Thus, lncPro performed well on RPI488 but showed poor performances on other datasets. IPMiner employed complex stacked ensembling [26] technique to increase the performance over its basic predictors including RPISeq-RF. In implementation, each of the three basic predictors of IPMiner was trained three times by a three-fold cross validation to generate the training and testing data for the LR-based ensemble part. As shown in Figure 5, this complicated ensemble manner brought about noteworthy Acc increase over RPISeq-RF on datasets RPI488 (0.010), RPI2241 (0.010) and NPInter (0.014). Nonetheless, the ensemble manner was too time-consuming for deep learning-based models. In contrast, RPITER directly trained each basic module one time, and then concatenated the outputs of four basic modules and trained the whole architecture one time again to make RPI predictions.

Discussion
In this study, we proposed a hierarchical deep learning-based framework RPITER, which contains two sequence coding parts as inputs and involves two basic CNN and SAE network architectures to generate comprehensive prediction result.
For sequence coding, we adjusted the CTF coding methods by adding more primary sequence information and sequence structure information into the coding vectors. According to the performance comparison on dataset RPI2241, our Improved CTF and Improved Struct CTF showed advantages over the previous CTF coding method. Besides, we tried three commonly used deep learning sequence coding methods of one hot, word2vec and doc2vec, but they performed worse than CTF form coding methods in this RPI prediction problem and were abandoned in this study. The sequence structure used in our coding method was predicted by the software, thus might only contain limited information. We noticed that RPI-Pred achieved 93% Acc using the experimentally validated structure but only 83% Acc using the predicted sequence structure on dataset RPI1807 [25]. Thus, more experimentally validated sequence structure data would contribute to more effective RNA-protein interaction prediction.
For model design, we found CNN architecture has a powerful fitting ability for the k-mer features of protein and RNA sequence compared with SAE architecture and machine learning models RF and SVM. We infer that our CNN-based modules using convolution neural networks are more complicated than our SAE-based modules using simple fully-connected layers, thus the CNN-based modules have larger sample demands for effective training than the SAE-based modules. Nonetheless, when sufficient samples are available, our CNN-based architecture can perform better in feature extraction and high-level feature representation than the SAE-based architecture. In 2017, Sun et al. [42] employed SAE architecture to process the conjoint triad features of protein sequence for protein-protein interaction prediction. Their model would probably have a performance increase if the SAE architecture were replaced by the CNN architecture, since CNN is more effective.
The designed deep learning model RPITER could integrate the advantages of four different basic predict modules and provide a comprehensive RPI prediction result. Based on experiments on five benchmark datasets, the proposed RPITER showed good performance in predicting RPI compared with the previous methods.

Benchmark Datasets
We employed five benchmark datasets (Table 4) from previous studies to validate the proposed method. Datasets RPI369, RPI488, RPI1807 and RPI2241 are extracted from the RNA-protein interaction complexes in the RNA-protein interaction database PRIDB [44] or PDB [45]. These four datasets are constructed following the criterion of least atom distance: if there exist a protein atom and a RNA atom with the distance between being less than the specified distance threshold, then the protein and RNA become an interaction pair. Additionally, RPI488 is comprised of lncRNA-protein interaction pairs, that is, the RNAs are ncRNAs longer than 200 nt. In contrast, rather than judging interaction by the atom distances in RPI complexes, the dataset NPInter contains the experimentally verified interactions between ncRNAs, especially lncRNAs, and proteins from the NPInter2.0 database [46]. Because RPI369, RPI2241 and NPInter datasets lack non-interaction pairs to work as negative samples in training model, the same number of non-interaction pairs were generated by randomly pairing the RNAs and proteins in positive samples and further discarding similar known interaction pairs [21,26] (a randomly generated pair R1-P1 was discarded if there existed an interaction pair R2-P2 with R1 and R2 shared 80% sequence identity and P1 and P2 shared 40% sequence identity). The cd-hit and cd-hit-est programs in CD-HIT version 4.6.8 [47,48] were applied to cluster RNA and protein sequences with identity thresholds being 0.4 and 0.8, the smallest clustering cutoff values in CD-HIT tool for protein and RNA, respectively.  [46] In this study, the protein and RNA sequence data of RPI488 and NPInter were downloaded from https://github.com/xypan1232/IPMiner. The sequence data of RPI369, RPI1807 and RPI2241 were retrieved from PDB. The structure information of protein sequence used in our study was predicted by SOPMA [49]. We employed their web server (https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat. pl?page=npsa_sopma.html) to calculate the classical three state protein secondary structures: α-helix, β-sheet and coil. For RNA structure prediction, we used RNAfold program in ViennaRNA Package version 2.4.3 [50] to calculate the two state dot-bracket representation of RNA secondary structure with min free energy. The sequence data and predicted sequence structure data of five RPI datasets are organized and available at https://github.com/Pengeace/RPITER.

Sequence Coding Method
To input RNA and protein sequences into deep learning or conventional machine learning models, the sequence data must first be transformed into numerical representations. Because the RNA and protein sequence length vary in a large range (0-4000) in our datasets, the commonly used fixed length sequence encoding methods for deep learning models, such as one hot encoding, do not fit our problem. Therefore, we adopted and improved the CTF method, which counts k-mer frequency in sequence to form fixed length vector representation.
In previous CTF methods [21,26], In this study, when only considering the sequence information, we extended the range of k to 1-3 in the k-mer frequency coding process for protein, and extend the range of k to 1-4 in the k-mer frequency coding process for RNA. That is, for protein, we computed the 1-mer, 2-mer and 3-mer frequency information to form an extended coding vector with 399 (∑ 3 i=1 7 i ) elements. For RNA, we compute the 1-mer, 2-mer, 3-mer and 4-mer frequency information to form an extended coding vector with 340 (∑ 4 i=1 4 i ) elements. By adding more sequence information into its coding vector, we hoped to enhance the model input to finally improve the model prediction performance. This sequence coding method is referred as Improved CTF.
When taking the sequence structure information into consideration, we calculated the 1-3-mer frequency of protein secondary structure and the 1-4-mer frequency of RNA secondary structure to complement the sequence coding vectors. For protein, combining the 1-3-mer frequency of three kinds of secondary structure (α-helix, β-sheet and coil) with previous reduced seven-letter alphabet would generate the protein coding vector with 438 (∑ 3 i=1 3 i + 7 i ) elements. For RNA, integrating the 1-4-mer frequency of two kinds of secondary structure (dot and bracket) with four ribonucleotides would produce the RNA coding vector with 370 (∑ 4 i=1 2 i + 4 i ) elements. This sequence coding method is noted as Improved Struct CTF. Table 5 summaries the coding length of protein and RNA by the above three different sequence coding methods.

Performance Evaluation
In this study, we evaluated the performance of RPITER and other methods by six metrics, Acc, Sn, Sp, Pre, MCC, and AUC. The formulas of the first five measurements are as follows: where TP and TN mean the number of correctly predicted positive and negative samples, respectively; FP and FN denote the number of wrongly predicted positive and negative samples, respectively; and P and N represent the total number of positive and negative samples, respectively. All these evaluation metrics were calculated by five-fold CV and no overlap between training and testing data.

Model Design
We designed a deep-learning framework, RPITER (Figure 1), to tackle the RNA-protein interaction problem. Our two sequence coding methods are used in the two sequence coding parts: Improved CTF coding and Improved Struct CTF coding. After each coding part, two different network architectures, CNN and SAE, are employed in two modules to extract features from the input and form high-level representations. In each module, the protein and RNA coding vectors are analyzed by two similar parts separately to form respective sequence embedding representations. Finally, an ensemble module integrates the outputs from the four basic modules (Conjoint-CNN, Conjoint-SAE, Conjoint-Struct-CNN, and Conjoint-Struct-SAE) to form the whole architecture of RPITER. The source code of RPITER can be freely download from https://github.com/Pengeace/RPITER.
In module Conjoint-CNN and Conjoint-Struct-CNN, first, two similar sequence embedding parts using CNN analyze the RNA and protein input vectors separately and form two sequence embeddings. Then, a three-layer fully-connected part concatenates the two sequence embeddings as input and make the interaction prediction. Each sequence embedding part with CNN has three convolution layers with filter numbers being 45, 64 and 45, thus is denoted as Conv-45-64-45. Between two convolution layers, max-pooling layer is used to reduce the representation dimension and introduce an invariance to noises. After the last convolution layer in Conv-45-64-45, the output two-dimensional tensor is flattened and further serves as the input of a fully connection layer with 128 neurons, denoted as Dense-128. Then, the two sequence embedding representations of RNA and protein are output by the two Dense-128 layers separately. Finally, a three-layer fully-connected part with 128, 64 and 2 neurons in each layer, Dense-128-64-2, takes the previous two sequence embeddings and makes the interaction prediction. In Dense-128-64-2, the input of the first Dense layer is the concatenation of the protein and RNA embeddings, and the output of the last Dense layer is the prediction result, which is further integrated by the later ensemble module.
In module Conjoint-SAE and Conjoint-Struct-SAE, first, two similar sequence embedding parts using SAE analyze the RNA and protein input vectors separately and generate two sequence embeddings. Then, a three-layer fully-connected part concatenates the two sequence embeddings as input and makes the interaction prediction. Each sequence embedding part with SAE has three fully-connected layers with neuron numbers being 256, 128 and 64, thus is denoted as Dense-256-128-64. After dimension reduction and high-level feature abstraction by the two three-layer SAE parts, the sequence embedding representations of RNA and protein are output by the last layers of two Dense-384-256-128 parts. Finally, a three-layer fully-connected part Dense-128-64-2 concatenates the previous two sequence embeddings as input for its first layer and makes the interaction prediction for a specific RNA-protein pair at the third layer. Similar to previous CNN-based modules, the prediction results of the two SAE-based modules are further integrated by the later ensemble module.
The last ensemble module concatenates the prediction results of former two CNN modules and two SAE modules as its input tensor and generates a more comprehensive prediction result for a given RNA-protein pair. The ensemble module is designed as a three-layer architecture, Dense-16-8-2, with three fully-connected layers in which the neuron numbers are 16, 8, and 2, respectively.
The four basic modules and ensemble module use the Softmax [51] activation function at their last layers to make binary predictions, and use back-propagation algorithm [52] to minimize loss function of binary cross entropy. Two optimization methods, Adam [53] and stochastic gradient descent (SGD) [54], are employed successively to train each module, among which Adam first gives the module a quick converge and then SGD is used to fine tune the module after. In Conv-45-64-45, Dense-128-64-2, and Dense-16-8-2, batch normalization [55] is employed to reduce internal covariate shift and help train the designed deep network, and ReLU [56] activation function is used to speed up the supervised train process. During the unsupervised pre-training process of the three-layer SAE, its parameters are optimized by greedy layer-wise training. To avoid over-fitting problem, the techniques of dropout [57] and early stopping [58] are also used. After each training epoch on the training data, the Acc of a training module is recorded. If the prediction Acc this time is higher than the highest Acc record in previous epochs, then the whole module parameters are saved and the highest Acc record is updated. If the highest Acc record on training data has not been updated for specific epochs, the train process is stopped and the module loads the previous lastly saved weights and tests once on the test data. Our model was implemented by the Keras2.0 library (https://keras.io/).

Conflicts of Interest:
The authors declare no conflict of interest.