Long Noncoding RNA and Protein Interactions: From Experimental Results to Computational Models Based on Network Methods

Non-coding RNAs with a length of more than 200 nucleotides are long non-coding RNAs (lncRNAs), which have gained tremendous attention in recent decades. Many studies have confirmed that lncRNAs have important influence in post-transcriptional gene regulation; for example, lncRNAs affect the stability and translation of splicing factor proteins. The mutations and malfunctions of lncRNAs are closely related to human disorders. As lncRNAs interact with a variety of proteins, predicting the interaction between lncRNAs and proteins is a significant way to depth exploration functions and enrich annotations of lncRNAs. Experimental approaches for lncRNA–protein interactions are expensive and time-consuming. Computational approaches to predict lncRNA–protein interactions can be grouped into two broad categories. The first category is based on sequence, structural information and physicochemical property. The second category is based on network method through fusing heterogeneous data to construct lncRNA related heterogeneous network. The network-based methods can capture the implicit feature information in the topological structure of related biological heterogeneous networks containing lncRNAs, which is often ignored by sequence-based methods. In this paper, we summarize and discuss the materials, interaction score calculation algorithms, advantages and disadvantages of state-of-the-art algorithms of lncRNA–protein interaction prediction based on network methods to assist researchers in selecting a suitable method for acquiring more dependable results. All the related different network data are also collected and processed in convenience of users, and are available at https://github.com/HAN-Siyu/APINet/.


Introduction
Long non-coding RNAs (lncRNAs) are non-protein-coding transcripts with a length of more than 200 nucleotides, which can regulate gene expression at different levels [1]. LncRNAs were first regarded as transcriptional noise, and later it was found that they can play an important role in cell division, differentiation, metabolism and other physiological processes [2][3][4]. With the development of biotechnology and the emergence of computational models, there is now a great deal of evidence suggesting that lncRNAs are significant in diverse mechanisms and are involved in almost the whole process of cells from one division to the next [5,6], such as in transcriptional and post-transcriptional heterogeneous data, not only avoiding ignoring the external links between molecules, but also mining the hidden topological structure information in heterogeneous networks.
Nowadays, network science is being extensively used in biological and related fields. It provides many practical descriptions to characterize various biological systems [36] and the relationships between diseases and biological factors [37]. Network science is becoming more and more popular, and has achieved remarkable results in various fields of bioinformatics. Network science has also made rapid advances in disease gene prioritization [38], disease lncRNA prioritization [39][40][41], disease-related miRNA identification [42][43][44][45][46][47][48], disease metabolite prioritization [49] and drug-target interaction prediction [50][51][52]. In this paper, we focus on re-viewing network-based methods used for integrating heterogeneous data to predict lncRNA-protein interactions directly. The materials, interaction score calculation algorithms, and advantages and disadvantages of state-of-art algorithms of lncRNA-protein interaction prediction based on network methods are summarized and discussed to assist researchers in selecting a suitable method for acquiring more dependable results. This article is organized as follows. Section 2 summarizes the relevant databases used for analyzing lncRNA-protein interaction. Section 3 gives a brief introduction to experimental approaches and machine learning-based computational approaches for studying lncRNA-protein interaction. Section 4 systematically analyzes biological network-based computational models for lncRNA-protein interaction prediction. Section 5 includes the performance comparison of different network-based models for lncRNA-protein interaction prediction. And Section 6 briefly summarizes the discussion in this paper and looks forward to the future feasible methods.

A Brief Introduction to the Relevant Databases Used for Analyzing LncRNA-Protein Interactions
The various databases discussed in this article incorporate lncRNAs from different tissues and focus on lncRNAs as well as lncRNA-related interactions. Some of these databases are available at RNAcentral [53]. Although there is a great deal of overlapping sections among these databases, each database nonetheless offers considerable unique features. We present herein an overview of their respective contents and search features in order that researchers can get a quick glance of what each can offer. Then, we give a brief summary of the relevant databases mentioned in Table 1, including the name and website of the database and a brief description. We provide data information on all possible interactions between biomolecules that may be used in the research of lncRNA functions (which users can browse and download from https://github.com/HAN-Siyu/APINet/), that is, lncRNA-disease associations, lncRNA-lncRNA interactions, lncRNA-microRNA interactions, lncRNA-gene interactions, lncRNA-Gene Ontology (GO) interactions, microRNA-microRNA interactions, microRNA-disease associations, microRNA-gene interactions, microRNA-target interactions, gene-gene interactions, gene-metabolite interactions, metabolite-metabolite interactions, gene-GO interactions, gene-disease associations, gene-drug associations, metabolite-disease associations, drug-disease associations, drug-drug interactions, drug-side-effect interactions and and disease-disease interactions. The details of the data information are shown in Table 2. As some interaction data are integrated by multi-source data, in Table 2, we can see the types of these interactive data information, the number of sets of interaction data composed of several biological molecules and the sources of these data, which determine association data that can be used to construct heterogeneous networks, i.e., the composition of heterogeneous networks.

LncRNA-Protein Interactions: From Experimental Approaches to Computational Models Based on High-Throughput Experiments
Several large-scale experimental approaches for lncRNA-protein interaction prediction include RNA immunoprecipitation (RIP) followed by mass spectrometry analysis, RNAcompete [18], RIP-Chip [19], high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-CLIP) [20], and photoactivatable ribonucleoside-enhanced crosslinking and immunoprecipitation (PAR-CLIP) [21]. Although these approaches can provide valuable data to construct a network of lncRNA-protein interactions, they are expensive and time-consuming, which are disadvantages that cannot be ignored. It is therefore urgent to put forward the computational approaches.
There are many effective methods for the analysis of experimental data, such as several methods for finding RNA motifs from CLIP-Seq or other high-throughput experiments, such as BEAM [22] and SMARIV [23]. BEAM is a method for structural motif discovery from a set of unaligned RNAs. Tested in various scenarios, BEAM is successful in retrieving structural motifs even in highly noisy datasets, such as those that can arise in CLIP-Seq or other high-throughput experiments. To solve the problem that the previous methods cannot provide information about protein structure preferences, the sequence and structure preferences of RNA-binding proteins can be inferred based on the feasibility of obtaining RNA structure information. SMARTIV is a novel computational tool for discovering combined sequence and structure binding motifs from in vivo RNA binding data relying on the sequences of the target sites, the ranking of their binding scores and their predicted secondary structures. The combined motifs are presented in a unified form, which is rich in information and easy for visual perception. These high-throughput experimental data can be used to predict the next step by developing machine learning methods. The quality of these models depends directly on the experimental data. At present, NPInter database and StarBase database are constructed from high throughput experimental data and are existing databases for lncRNA-protein interactions.

LncRNA-Protein Interactions: From Experimental Results to Computational Models Based on Machine Learning
Computational approaches for lncRNA-protein interaction prediction can be grouped into the following two ways of expressions. The first category is based on sequence and structural information and physicochemical properties, including RPISeq [24], de novo prediction [25], CatRAPID [26], LncPro [5], RPI-Pred [27], rpiCOOL [28], IPMiner [29] and lncADeep [30]. The second category is based on the fusion of heterogeneous data to construct a network, such as the lncRNA-protein bipartite network inference (LPBNI) method [31], fusing multiple protein-protein similarity networks (PPSNs) [32], the method to predict lncRNA-protein interactions based on the relevance search method proposed by Yang et al. [33], the prediction method of interactions between lncRNAs and proteins on heterogeneous networks (LPIHN) [34] and the predicting lncRNA-protein interactions using HeteSim scores (PLPIHS) method [35].
From the point of view of characteristics such as sequence information, various classical methods have been proposed. RPISeq [24] is proposed to predict RNA-protein interactions only using sequence information. The support vector machine (SVM) classifier and the random forest (RF) classifier, which are supervised machine learning algorithms, are used in the RPISeq. De novo prediction of RNA-protein interactions [25] also only considers sequence information. A set of known RNA-protein interactions is collected as gold-standard positives, where sequence-based features are extracted for each RNA-protein pair [25]. In the process of constructing the Bayes classifier, these effective features are used to train an RNA-protein interaction prediction model. CatRAPID [26] is proposed by using physicochemical properties, including the secondary structures of the molecules and their propensities for hydrogen bonding and van der Waals interactions. Encoding the protein-RNA pairs into feature vectors is the first step, followed by calculating the interaction score through the matrix computation. LncPro [5] is proposed to predict ncRNA-protein interactions by using Fisher's linear discriminant approach. The training features are not only from protein secondary structures and their propensities for hydrogen bonding and van der Waals interactions, but also from RNA secondary structures [93]. LncPro also requires the identification of a matrix and calculation of the interaction score to represent degree of interactions through matrix computation by a simple machine-learning model for matrix computation. RPI-Pred [27], a SVM-based machine-learning approach, is proposed by considering sequence features and combining the high-order structures of both proteins and RNAs. This interaction prediction considers protein blocks rather than classical three-state protein secondary structures. Five classes of RNA secondary structures are regarded as high-order structures. RpiCOOL [28] is a tool developed for detecting RNA-protein interactions in silico by using the RF classifier, which classifies RNA and protein based on whether there are interactions between them. The sequence composition and repetitive patterns are used as heterogeneous information of the protein and RNA, which is then used to encode feature vectors to express pairs between RNA and protein. IPMiner [29], a tool based on simple sequence composition features, integrates deep neural network and stacked ensembling classifiers to identify RNA-protein interactions. The extracted original features, SDA (stacked denoising autoencoder) and SDA-FT (SDA with fine tuning), are provided to the RF classifier, respectively. The outputs of these three classifiers, which are trained by a logistic regression mode, are integrated through superposition. These computational methods fill the broadening gap between raw and annotated data that has been generated as a result of the large amount of data obtained by high-throughput technologies. LncADeep is proposed to predict lncRNA-protein interactions based on deep neural networks, using both sequence and structure information.
With the development of computational approaches, experimental methods are now suffering the great disadvantage to predict lncRNA-protein interactions, such as high cost and long time. Intrinsic features of lncRNA and protein have increasingly interested the researchers. The advantage of intrinsic features has been demonstrated in the research of lncRNA identification. The methods of lncRNA-protein interaction prediction focus on intrinsic features of lncRNA and protein, such as sequence information, structure information, and physicochemical properties, including hydrogen-bond and van der Waals propensities. We analyzed the dataset of methods based on what kind of information they use, such as sequence, structure and physicochemical information. We also analyzed what machine learning algorithms are employed in the different methods. The comparison of each method is shown in Table 3. In this article, we give a more detailed introduction to each computational model for lncRNA-protein interaction prediction based on intrinsic features of lncRNA and protein. To make it easier for users to use these computation models, we have supplemented the availability network resources. We give more details about each computational method's availability, such as the web server or offline package for lncRNA-protein interaction prediction based on sequence and structural information and physicochemical properties.
Whereas the machine learning-based methods only consider the properties of the RNAs or proteins and neglect interactions between lncRNAs and proteins, the network-based methods pay more attention to this kind of interactions, which are implicated in the topologies between nodes in the heterogeneous networks of lncRNAs. When the sequence is too long or the randomness of structural information is predicted, the computational models based on machine learning will be affected to some extent.

Computational Models for LncRNA-Protein Interaction Prediction Based on Biological Networks
The previously described methods for predicting the interactions between lncRNAs and proteins more focus on the intrinsic features of lncRNAs and proteins but do not take the topological structures of biological networks associated with the lncRNAs into consideration. A biological network can apply to biological systems. Nowadays, network science is being used extensively in the biological and related fields. Network science provides many practical descriptions of biological systems and relationships between diseases and other biomolecules as biological factors [33]. Moreover, we could integrate known lncRNA-protein interaction networks, lncRNA-lncRNA similarity networks and PPI networks that were downloaded in the databases and fused by multiple PPSNs to construct heterogeneous networks and implement a model based on computing node similarity between networks to discover possible interactions between lncRNAs and proteins, such as random walk on heterogeneous networks and kinds of propagation algorithms that can discover potential associations. The overview is presented in Figure 1. We analyzed which heterogeneous data are selected by each method, how to fuse heterogeneous data to construct the network, and what methods are used to deal with heterogeneous networks to predict lncRNA-protein interactions. We analyzed the differences among the different network-based methods such as the datasets that are used in each method, how to fuse heterogeneous data to construct the network and algorithms for specific computation interactions. The differences of each network-based method are shown in Table 4. In this articl, we give a more detailed introduction to each computational model for lncRNA-protein interaction prediction based on biological networks. Table 3. The comparison of each method by analyzing the differences in intrinsic features and classifiers.
CatRAPID [26] RPISeq [24] De novo [25] LncPro [5] RPI-Pred [27] rpiCOOL [28] IPMiner [29] lncADeep [30] Feature The secondary structure (RNA) √ √ The secondary structure(protein) √  Bold representation performs best in AUC values and we found that the performance of the method is better when the heterogeneous network is composed by more sources. When heterogeneous networks are constructed by the same sources, the performance will be better for the heterogeneous networks constructed by weighted networks. 1 https://github.com/USTC-HIlab/LPBNI (offline package); 2 https://github.com/cyang235/LncADeep (offline package); 3 lncRNA-protein interactions; 4 protein-protein interactions; 5 lncRNA-lncRNA interactions; 6 A relevance search based on random walk in heterogeneous network to evaluate the relevance between a pair of lncRNA and protein, and a large relevance score means a high possibility that the lncRNA and protein interacts [94]. 7 Similarity Network Fusion: It is a nonlinear message-passing based method that iteratively updates each network and makes it more and more similar to the other [95].   Figure 1. Overview of five computational models for lncRNA-protein interaction prediction based on network method, including data collection and core algorithm. Illustration: The specific algorithm implementation of each method is represented by rectangular boxes with dotted lines of different colors, and the solid lines with different colors outside the rectangular boxes of dotted lines represent the data sources used by different methods. These colors are the same as the colors used by method names. In addition, the solid line color in the dotted rectangular frame is used to distinguish the interaction of lncRNA-lncRNA, protein-protein or lncRNA-protein.

LPBNI: A Bipartite Network-Based Method for the Prediction of LncRNA-Protein Interactions
Inspired by resource methods in dynamically allocated networks, Zhou et al. [96] proposed algorithms based on the propagation process of the LPBNI method. Li et al. [34] developed this method on the basis of an lncRNA-protein bipartite network to predict lncRNA-protein interactions. A graph G can be used to represent the lncRNA-protein interaction network. The structure of the bipartite network of lncRNA-protein is simply shown in graphic language, as shown in Figure 2. Finally, they chose to apply the propagation method on the constructed network and calculated the degree of lncRNA-protein interactions as a score. In the G(L, P, E), the propagation matrix is used as W, where W ik represents the information transferred from the p k node to the p i node, and the transmission of key information between two nodes represents the importance of nodes. For each lncRNA l j , they defined S 0 (i) = s ij , i ∈ {1, 2, . . . , m} as the first information on protein P, where s ij = 1 if p i interacts with l j ; otherwise, s i,j = 0. S L (l j ), j ∈ {1, 2, . . . , n} represents the score on l j after the first step of information propagation, which can be calculated as In the formula above, d(p i ) = ∑ n j=0 a ij is the number of lncRNAs that interact with p i .

(3) Two-step propagation for illustrations of the LPBNI in bipartite network for lncRNA-protein interaction prediction (3) Two-step propagation for illustrations of the LPBNI in bipartite network for lncRNA-protein interaction prediction
If there are n lncRNAs related to the protein, the weight of the protein is divided into n parts. The weight of this protein to the side of each lncRNA is 1 / n.

(3) Two-step propagation for illustrations of the LPBNI in bipartite network for lncRNA-protein interaction prediction
If there are n lncRNAs related to the protein, the weight of the protein is divided into n parts. The weight of this protein to the side of each lncRNA is 1 / n. In the next step, all the information in L propagates back to P. S F (p i ) is defined as the final information on protein p i , signifying the interaction score of protein p i with l j . S F can be defined as where d(l j ) = ∑ m l=0 a ij is the number of proteins that interact with l j . The final information S F can be defined in the matrix as where S 0 is the column vector of S 0 , and S F is the final score of the lncRNA that users query after the two-step information propagation in the lncRNA-protein interaction network. S F can be represented as After calculations, the protein sorted by the final score S F for l j is obtained. All the candidate proteins are ranked in decreasing order, and proteins with a high ranking are considered to interact with lncRNA l j .
LOOCV was performed on the heterogenous network containing lncRNA-protein interactions, leaving only one sample for the test set at a time, and the other samples were used as the training set. Although the calculation was more complicated than other verification methods, the sample utilization rate was the highest. LOOCV aws used to evaluate the performance of the proposed method. In the course of the calculation, each lncRNA-protein pair was omitted as a test sample by changing the value in the adjacency matrix A to 0. The performance of LPBNI could be estimated by the ratio of its predicted interactions to the originally known lncRNA-protein. A receiver operating characteristic (ROC) curve was selected as a criterion to evaluate the LPBNI and random walk with restart methods. The propagation matrix W proposed in the LPBNI method is dependent on the adjacency matrix A of the bipartite network. When applying LOOCV, multiple values of W were obtained, owing to the change of A values during each step of the cross-validation. Consequently, the value of W was recalculated for each lncRNA-protein pair that was left out as a test sample. In addition, nodes that do not propagate information are not considered when evaluating the performance of the method, where nodes with fewer than two links are defined as nodes that do not propagate information in the process of cross-validation.

Fusing Multiple Protein-Protein Similarity Networks to Effectively Predict LncRNA-Protein Interactions
To improve the performance of lncRNA-protein interaction prediction, Zheng et al. [32] fused multiple PPSNs to construct a multilevel heterogeneous network. New lncRNA-protein interaction predictions are inferred by integrating the fused PPSNs with known lncRNA-protein interaction predictions (Figure 3). Protein sequences, protein domains, GO terms, and the STRING database are first used to construct four Protein-Protein Similarity Networks (PPSNs), following which the SNF algorithm [95] is employed to combine the four protein-protein similarity networks into a fused protein-protein similarity network. Then, a heterogeneous lncRNA-protein network is built including based on the fused protein-protein similarity network and the known lncRNA-protein interactions. Finally, the HeteSim algorithm [94] is used to infer new lncRNA-protein interaction predictions. Extensive experiments show that this approach outperforms not only the existing methods for predicting the lncRNA-protein interactions, but also performs well by using only one PPSN as a protein-protein interaction network without fusing four different aspects of the protein-protein similarity network into a protein-protein interaction network. After fusing all the four matrices, the area under the curve (AUC) value of 0.9068 indicates the best performance. This result shows that a more reliable and informative network can be obtained by fusing multiple matrices. The advantage of SNF algorithm is that it can obtain valuable information from a relatively small number of samples, and it has strong robustness in dealing with noise and data heterogeneity. It is a nonlinear method based on the typical nature of the complexity of the natural world based on message-passing. The nonlinear method is closer to the nature of the objective thing itself. It is one of the important methods to quantitatively study and understand complex knowledge. This method iteratively updates each network and makes it more and more similar to other networks. A protein similarity network can be represented as a graph G = (V, E), where V = {v 1 , v 2 , . . . , v n } represents a set of corresponding proteins in the network, and E represents a set of edges, each of which has a similarity weight. The authors denoted the corresponding similarity matrix as W, where W(i, j) is the similarity between proteins v i and v j . They defined a full and sparse kernel on each matrix in order to compute the fused network from four protein similarity matrices. The full kernel is a normalized weight matrix P = D −1 W, where D is a diagonal matrix and D(i, j) = ∑ j W(i, j). Because P involves self-similarities on the diagonal entries of W, a better form for avoiding numerical instability is as follows [96]: Protein v i 's neighbors are denoted as N i and use k nearest neighbors (kNN) to measure the local part as follows: A protein has much better similarities to its neighbors than it has to remote proteins. Similarity based on graph diffusion principle can be propagated to remote proteins. Matrix P provides all the information of the PPSN, whereas S provides the local similarity information of the network. Then, iterative computation can occur as follows: where P (i) t is the ith similarity matrix after t(≥ 0) iterations, and S (i) is the kNN matrix of the similarity matrix or network. Following that, m is the number of PPSNs used. As S is the kNN matrix of P, it contains the most important information of P and also alleviates the noise effect of P. In each iteration, each similarity matrix can get more reliable information from other similarity matrices, at the same time, it will update its own matrix based on other similarity matrices. After t iterations, the fusion network can be replaced by a fusion matrix, which is defined as follows: Note that the authors normalized matrix P t after each iteration, each protein has a higher degree of similarity to itself in order to ensure that the matrix is in a full rank state than other proteins. With the known lncRNA-protein interactions and the fused PPSN, they built a lncRNA-protein heterogeneous network, on which a random walk model HeteSim was used to infer new lncRNA-protein interactions. HeteSim is used to evaluate the relevance between a lncRNA-protein pair, where a large relevance score means the lncRNA and protein have more interactions.
For this method, 15 settings made up of different combinations of the similarity matrices (Seqs, Pfam, GO, and STRING, respectively) were implemented. The path selection is very important since HeteSim is a path-constrained relevance measure. In the fusion work, the relevance path was chose as lncRNA-protein-protein, which was the same as that used in the work of Yang et al. [33]. With the proof of the experiment and more matrix merging, the AUC value becomes more ideal. For example, the AUC value of GO + Pfam + STRING is 0.9066, which is larger than the AUC value of GO + Pfam, GO + STRING and Pfam + STRING. When all four protein similarity matrices were fused, AUC achieved the best result of 0.9068. This shows that, with the increase of the number of fusion matrices, we could get more specific information of protein similar network. This multi-matrix fusion method is convenient to get more reliable and informative data representation.

Prediction of Interactions between lncRNA and Protein by Using Relevance Search in a Heterogeneous LncRNA-Protein Network
Yang et al. [33] tried to use the possible hidden information in the biological network topologies containing lncRNA layer networks. Thus, an algorithm named HeteSim is introduced to measure the relevance between lncRNAs and proteins on the basis of the heterogeneous lncRNA-protein network, which integrates the known lncRNA-protein interaction networks and PPSNs. Figure 4 shows a network model and the schema of the interaction network. The AUC of HeteSim for the lncRNA MALAT1 is 0.955. The performance results of network-assisted method confirm a difficult problem. It is difficult to break through the low conservatism of lncRNAs by traditional methods to predict the interactions between lncRNAs and proteins, which is a challenge to propose new methods to predict lncRNA-protein interactions, which generally uses information from intrinsic features of the RNA and protein alone. Their approach also demonstrates the tremendous value of the network-based approach in lncRNA-related fields, and has valuable implications for predicting interactions in heterogeneous networks constructed from biomolecules.  In the HeteSim algorithm [94], relevance paths are defined. A relevance path P, denoted as , is a path defined over the schema T G = (A, R). A composite relation R = R 1 • R 2 • · · · • R l between node types A 1 and A l+1 is revealed by the symbolization of the relevance path, where • denotes the composition operator of relations. For a given relevance path R = R 1 • R 2 • · · · • R l , HeteSim can measure the similarity between two objects x and y (x ∈ R 1 .X and y ∈ R 1 .Y) according to the relevance score: O(x|R 1 ) represents the out-neighbors of x based on relation R 1 R 2 , and I(y|R l ) represents the neighbors of y based on relation R l−1 • R l . In fact, x and y can also be the same type according to the random walk model pair. For an arbitrary relevance path P = A 1 A 2 · · · A l+1 , the HeteSim relevance between any two objects a ∈ A 1 and b ∈ A l+1 is the corresponding component in the score matrix named HeteSim (A 1 , A l+1 P). Finally, the relatedness between A 1 and A l+1 in the relevance path P = A 1 A 2 · · · A l+1 is defined as follows: Based on the random walk model [37], P is divided into two equal path lengths P L and P R , where P L = A 1 A 2 · · · A mid−1 M and P R = MA mid+1 · · · A l+1 . Depending on whether the length of P is even or odd, the node type of M is impacted differently. If the length of P is even, M is the middle position node type, which could be one of A. Otherwise, it is just the defined middle type. P R is equal to P −1 L . The transition probability matrix of A i → A j denoted as U A i A j is the normalized matrix of the adjacent matrix W A i A j that contains the row vector, and the transition probability matrix of A i → A j denoted as V A i A j is the normalized matrix of W A i A j that contains the column vector. It easily proves that V A i A j is equal to U A i A j . Finally, the score between two objects is normalized to ensure that the correlation between the same objects is 1. Based on HeteSim algorithm in the heterogeneous network of lncRNA-protein, the lncRNA-protein-related pathway is considered. In this network, a group of data is randomly extracted from the measured data as a training dataset, and the rest of the data are used as the test dataset. The AUC of HeteSim achieved on the lncRNA-protein-protein path is 0.879.

LPIHN: LncRNA-Protein Interaction Prediction Based on Heterogeneous Network Models
Based on this assumption, interrelated lncRNAs tend to exhibit interaction patterns that have similarities with proteins. Li et al. [34] proposed the network-based computational method LPIHN for predicting new lncRNA-protein interactions. The LPIHN procedure is shown in Figure 5. A heterogeneous network is constructed, which is integrated by a similarity network containing lncRNA-lncRNA expression data, a lncRNA-protein interaction network and a PPI network. The similarity network containing lncRNA-lncRNA expression data is calculated by the Pearson's correlation coefficient [97][98][99][100][101][102] between the expression profiles of each lncRNA-lncRNA interaction. The lncRNA-protein interaction network is constructed from NPInter, by Shang et al. [103], who made a detailed and comprehensive analysis of it. The protein-protein interaction network is not a single source; it is based on computational prediction methods, and some of the interaction data are obtained through high-throughput experiments, from the STRING v9.1 database [104] to text mining, data obtained from the three weighted protein interaction degrees. Then, they walk randomly over the heterogeneous network to infer and predict the interaction between new lncRNAs and proteins. In the RWR procedure [37], an iterative walker starts at a source node with the first probability, and then it can either move to a randomly selected direct neighbor in the process of random walking or restart at a source node with probability δ in each step. Therefore, when random walks are completed on heterogeneous networks, researchers can determine the initial probability, transfer matrix, and restart probability. However, it is based on information provided by heterogeneous networks. During the process of predicting the potential proteins for lncRNA l i , Y 0 represents the first probability of the walker starting at every node, where l i and the proteins that are known to interact with l i are assigned positive values, and the nodes that remain are assigned as zero. It means that the node where the random walk begins is l i , or that the protein interacts with l i . Y i represents the relevance of l i to all other nodes, where j represents the node and t represents the step. Y t+1 can be defined by the following equation: where δ ∈ (0, 1) represents the restart probability of the random walk. W is the transition matrix and Y 0 is the first probability of the random walk. For a given lncRNA l i , l i is the seed node in the lncRNA network, the probability of vertex l i is 1, and other elements in the lncRNA network are assigned as zero, which forms the first probability of the lncRNA network v 0 . When protein p j interacts with lncRNA l i , p j becomes the seed node in the protein network. The first probability vector of the protein network u 0 is formed by assigning equal probabilities to the protein seed nodes. For the heterogeneous network, the first probability is The parameter β ∈ (0, 1) can decide whether to focus more on lncRNA networks or more on protein networks. When β = 0.5, failure to focus more on one side of a similar network means that the lncRNA-lncRNA similarity network and the PPI network are given the same weight. With β < 0.5, the random walk tended to return to the protein network. The transition matrix was defined in order to complete the random walk on the heterogeneous network. The authors defined W = W P W PL W LP W L as the transition matrix, where W P is the subnetwork transition matrix showing the probability of the random walker transiting between the protein and another protein in the random walking process. W L between lncRNA and another lncRNA can be calculated in a similar way. W PL represents the probability of the random walker transiting from the protein network to the lncRNA network, and W LP represents the relationship of the lncRNA network to the protein network. In the process of transition, they defined γ as the probability of the random walker transiting from the protein network to the lncRNA network, where the reverse is also true. W, the probability of the random walker transiting from protein p i to p j , is defined as where ∑ k I(i, k) = 0 means that p i can bind to multiple lncRNAs and at least one lncRNA, and can be transferred from p i to a similar network of lncRNA-lncRNA at random. In this case, the probability with γ of p i transferring to l i can be further calculated. The probability of p i transiting to p j should multiply 1 − γ. The probability of transiting from lncRNA l i to l j can be defined as: The probability of transiting from protein p i to lncRNA l j is defined as where ∑ k I(i, k) = 0 means that p i is bound to at least one lncRNA, and the walker can transit to the lncRNA-lncRNA network from p i with probability γ; under that condition, one can further calculate the probability of p i transiting to l j . The probability of transiting from lncRNA l i to protein p j can be defined in a similar manner as As the first probability Y 0 and the transition matrix W were defined, the RWR procedure [37] could be used for the heterogeneous network. After multiple iterations, the change between Y t and Y t+1 was less than 10 −10 , which meant that a stable probability The result of the LOOCV test showed that the approach could achieve 0.96 with an AUC value. Some predicted interactions between lncRNAs and proteins have been confirmed in recent research studies and databases, indicating the strong influence of LPIHN in predicting lncRNA-protein interactions. In each cross-validation experiment, the test dataset was associated with each known lncRNA-protein interaction, while the rest was used as a training dataset. The method has been successfully reconstructed and possible interactions have been evaluated. In particular, the authors use curves and fold enrichment to measure performance, and it is worth mentioning that the average-fold enrichment of all test data is also used to evaluate the model.

PLPIHS: Prediction of LncRNA-Protein Interactions Using HeteSim Scores Based on Heterogeneous Networks
Predicting the association between biological molecules based on biological networks has been widely used in many types of research, such as searching for gene sequencing of a disease [27] and predicting drug target interactions. Some of them have achieved good prediction results and good performance. Xiao et al. [35] proposed the PLPIHS method ( Figure 6) to predict lncRNA-protein interactions using HeteSim scores and they used a path metric to calculate the interrelationship between nodes in heterogeneous networks. Zeng et al. [105] inferred the association between heterogeneous nodes by means of uniform and symmetric metrics of random paths, regardless of whether they are the same or different types according to the score. Because the relevance path captures the semantic information and also also restricts the wandering path, the score depends on the similarity measure of the path. A heterogeneous network is first constructed with an lncRNA-lncRNA similarity network, which uses the Pearson's correlation coefficient between the expression profiles of each pair of lncRNAs to calculate the lncRNA-protein association network downloaded from GENCODE Release 24 [106] and a PPI network obtained from the STRING v10.0 database [107]. Then, they used the HeteSim to measure the degree of interaction of each lncRNA-protein in the network and showed it in fractions. The SVM classifier is built on the basis of the scores of different paths.
LOOCV is carried out to evaluate the performance of PLPIHS [108]. The results show that the AUC of PLPIHS for the network cutoff value of 0.3 is 96.8%, which is higher than LPIHN. Similarly, PLPIHS outperforms other methods in the 0.5 network and 0.9 network as well. A total of 2000 lncRNA-protein associations from positive samples of the 0.9 network and 2000 interactions from the remaining negative samples of the 0.3 network are randomly selected to construct an independent test set to further conduct the performance evaluation. Using this independent test set, PLPIHS achieves an AUC value of 0.879.

Results of Comparisons of the Network-Based Models for Predicting LncRNA-Protein Interactions
To compare the network-based methods, the fusion of heterogeneous data and performance evaluation were analyzed. All of the above-described methods used LOOCV to validate their respective performances. The test results of the network-based methods are shown in Table 5. Yang et al. [33] proposed that the relevance path is the same as fusing multiple PPSNs. They extracted MALAT1 and AK0951949, respectively, with all 99 proteins as two experimental datasets. Known interactions data between two lncRNAs and their protein chaperones are considered as positive samples, while negative samples are new pairs of lncRNA-protein interactions that have not been experimentally verified. From the ROC curves of the prediction results, the AUC is 0.955 for MALAT1 with all 99 proteins and 0.973 for AK0951949 with all 99 proteins.
LPBNI obtained 4870 lncRNA-protein interactions data from NPInter 2.0. The method used the propagation matrix and the lncRNA-protein interaction networks to set the test sample. First, the test sample is set according to the interaction pair of each lncRNA-protein in the adjacent matrix, leaving a node and setting one at the zero corresponding value of the adjacent matrix. In this process, some nodes will be deleted during the evaluation process. Considering that these nodes do not have more than two connection nodes, it is considered that there is no information dissemination between them. Compared with random walk with restart, it is clear that LPBNI showed the highest true positive rate in each false positive rate, and the AUC value was 0.878. PLPIHS selected data samples in different cutoff values of networks and obtained 2000 positive samples from 0.9 network and randomly selected 2000 negative samples from 0.3 network. PLPIHS calculated the AUC in different network cutoff values (0.3 and 0.9), where that for the 0.3 network was 0.968, which was higher than the value calculated by LPIHN. To verify that PLPIHS has better performance, the authors select the same number of positive and negative samples from different cutoff values of the network, respectively, and use this random selection to construct independent test datasets. Compared with the values generated by LPIHN, the AUC value of PLPIHS was 0.879. The accuracy, sensitivity, precision, Matthew's correlation coefficient, and F1-Score were also chosen as measurements to evaluate performance.
Fusing multiple PPSNs to effectively predict lncRNA-protein interactions was from the perspective of a fusion protein. The best relevance path was lncRNA-protein-protein according to HeteSim. The fusion matrix is an effective means for users to get more reliable and richer information matrix or network. The best AUC value was 0.9068 with Go+Pfam, Go+String, and Pfam+String. The AUC values of the 15 settings implemented in the paper by Zheng et al. [32] are shown in Table 5, which included using only one similarity matrix, fusing two similarity matrices, fusing three similarity matrices, and fusing all four similarity matrices.
In the LPIHN model, the determination of test datasets is consistent with other interaction prediction methods, leaving a cross-validation method. This model used not only LOOCV but also precision versus recall curves and fold enrichment to measure the performance, whereas the average fold enrichment of all test data was used for assessment. The LOOCV results showed that LPIHN obtained an AUC of 0.96. When more attention was paid to the predicted first 4870 lncRNA-protein interactions, 802 of the predicted LPIHN interactions were within this ranking.
To better understand the performance of network-based computational models to predict lncRNA-protein interactions, we divided the heterogeneous network into three cases according to the source of components: (1) only the lncRNA-protein interaction network; (2) the network combining the interactions of lncRNA-protein and protein-protein; and (3) the network with the integration of the interactions of lncRNA-protein, protein-protein and lncRNA-lncRNA. For each case, different methods were validated with the same set of test datasets, and the performances are compared by AUC in Figure 7. LPBNI (green) used leave-one-out cross validation on 4796-lncRNA-protein interaction network. The method proposed by Yang et al. [33] and method (orange) by Zheng et al. [32] used leave-one-out cross validation on 4467 lncRNA-protein interaction networks. The remaining two methods (blue) used leave-one-out cross validation on the dataset which 2000 lncRNA-protein interactions from network of PLPIHS with cutoff of 0.9 were extracted as positive samples, 2000 negative samples were randomly selected in 0.3 network. The gold set containing 185 lncRNA-protein interactions downloaded from NPInter database. In Figure 7, different colors represent different network types, and the same color bar graphs represent the verification results under the same set of data. In Figure 7, the performance of the method is better when the heterogeneous network is composed by more sources. When heterogeneous networks are constructed by the same sources, the performance will be better for the heterogeneous networks constructed by weighted networks. (The implication of more data here can be illustrated by the interactions of lncRNA-lncRNA. The interactions of lncRNA-lncRNA can be considered from many perspectives. It can be calculated from expression profile data, sequence alignment or experiment.) For example, the method proposed by Yang

Discussion
Prediction of the interactions between lncRNAs and proteins is a very important step for research about lncRNAs. Based on the results of lncRNA-protein interactions, the functions as well as the associated diseases of lncRNAs can be inferred. The lncRNA-protein interaction is a very significant molecular mechanism. Computational approaches to predict lncRNA-protein interactions can be grouped into two broad categories. The first category is based on intrinsic features of the lncRNAs and proteins, including the sequence, structural information, and physicochemical property. The second category is based on the fusion of heterogeneous data to construct a network.
Whereas the sequence-based methods only consider the properties of the RNA and neglect the internal relationship between the lncRNAs and proteins, the network-based methods have paid more attention to this kind of internal relationship. The main advantage of a network-based computational model is that it can predict lncRNA-protein interactions with heterogeneous data. Predictions using the intrinsic features of sequences alone may lead to more false-positive interaction pairs than that obtained using a network-based method. Unavoidably, the network-based computational model can have some disadvantages. The prediction of the network-based computational model can be affected when it is carried out in the case of finite interactions. When there are no interaction data, the network-based computational model cannot be used to predict interactions.
New lncRNA-protein interactions are predicted more effectively by using several kinds of heterogeneous data sources. As the study of proteins becomes ever more comprehensive, the proposed effective computational models for predicting lncRNA-protein interactions from heterogeneous biological data can benefit our understanding of more comprehensive annotations for lncRNAs.
Currently, there is very limited information on the interaction between lncRNAs and proteins, but computational methods can provide us with a large number of interaction pairs that can be further regarded as valuable material for the inference of lncRNA functions. First, a great deal of lncRNA-protein interactions can be provided by computational models based on intrinsic features. Second, since predictions using the intrinsic features of sequences alone may lead to some false-positive interaction pairs, computational models based on biological networks can be chosen to obtain more reliable predictions. In the future, a deep-learning-based framework can be considered to optimize the sequence-based and network-based computational models. Hopefully, long-short-term memory models can be employed to build a more advanced framework to build classifiers and achieve a more reliable classification model. We also can integrate machine learning with ab initio computing and network representation learning methods, and apply them to the prediction model of relationships between biological macromolecules. The interactions between lncRNAs and other molecules may enrich the functional annotations of lncRNAs. First, researchers can extract the characteristics of the molecules themselves by machine learning algorithm, and then they can use the appropriate algorithm in network representation learning to represent the feature vectors of relationships between nodes in heterogeneous networks. In this way, researchers can not only understand the internal features of molecules, but also not ignore the hidden topological information between molecules. This will overcome the weakness of most current research methods which only consider ab initio prediction or network-based methods.

Conflicts of Interest:
The authors declare no conflict of interest.