A Survey of Current Resources to Study lncRNA-protein Interactions.

Phenotypes are driven by regulated gene expression, which in turn are mediated by complex interactions between diverse biological molecules. Protein-DNA interactions such as histone and transcription factor binding are well studied, along with RNA-RNA interactions in short RNA silencing of genes. In contrast, lncRNA-protein interaction (LPI) mechanisms are comparatively unknown, likely driven by the difficulties in studying LPI. However, LPI are emerging as key interactions in epigenetic mechanism, playing a role in development and disease. Their importance is further highlighted by their conservation across kingdoms. Hence, interest in LPI research is increasing. We therefore review the current state of the art in lncRNA-protein interactions. We specifically surveyed recent computational methods and databases which researchers can exploit for LPI investigation. We discovered that algorithm development is heavily reliant on a few generic databases containing curated LPI information. We show that early methods predict LPI using molecular docking, have limited scope and are slow, creating a data processing bottleneck. Recently, machine learning has become the strategy of choice in LPI prediction, likely due to the rapid growth in machine learning infrastructure and expertise. While many of these methods have notable limitations, machine learning is expected to be the basis of modern LPI prediction algorithms.


Introduction
The introduction should briefly place the study in a broad context and highlight why it is important. Transcriptomics is the study of a complete set of RNA transcripts in a cell, measuring variable expression levels of the genome under different conditions. Modern transcriptomics is performed with high throughput sequencing to investigate the function of genes and biological pathways, commonly with bioinformatics methods applying differential gene expression analyses, splice site identification, transcript variant identification or determining alternative promoter usage for protein-coding transcripts [1]. However, these protein-coding transcripts only represent a small proportion of the transcriptome. A large proportion of the genome generates RNA transcripts which do not directly code for protein products [2]. These non-coding RNA (ncRNA) transcripts have been known to exist, but their properties make them difficult to characterize compared to coding transcripts. ncRNA can be divided into multiple categories based on function and length [3]. In this review, we specifically consider the long non-coding RNA (lncRNA) category of ncRNA and their interaction with proteins, an important functional mechanism of lncRNA.
LncRNA are very broadly defined as RNA transcripts exceeding 200 nucleotides (nt) in length without coding potential. Their length varies widely, ranging from hundreds to thousands of nucleotides [4]. LncRNA can act as a gene regulator, and like other epigenetic mechanisms are involved in numerous biological processes. They achieve their regulatory function with their ability to interact with a wide range of biological molecules, such as other nucleic acids and proteins [5], as well as with small molecules [6]. Among their more direct modes of action are sequestering and releasing transcript to modulate gene expression, stabilizing transcript and binding to DNA to sterically hinder transcription initiation [7]. More indirectly, they can recruit proteins and other molecules to form a functional complex, or act as a scaffold for targeted chromatin formation [8].
An important layer of lncRNA-mediated gene regulation is LPI (lncRNA-protein interactions). We illustrate the importance of LPI in developmental and abiotic stress pathways with several examples encompassing multiple distinct species. In Drosophila melanogaster, regulatory networks mediated by LPI regulate key eye development [9] and dosage compensation pathways [10] mediated by RNA binding proteins. In the plant Arabidopsis thaliana, LPI controls alternative splicing within the nucleus by selectively displacing existing transcripts and subsequently altering root development [11 and12]. Response to abiotic stress is also governed by LPI, as shown by a lncRNA recruiting histone methylases to suppress Arabidopsis thaliana flowering during cold conditions [13]. Dario renio LPI are also observed to interface with transcription factors and other RNA-binding proteins during embryonic development, although their exact mechanism of action is not well known [14]. LPI also act as mediators of other epigenetic mechanisms, for instance as chromatin scaffolds to organize the three-dimensional structure of the genome in Mus musculus [15] Due to the widespread involvement of LPI in epigenetics, dysregulation of certain LPI contributes to disease states, particularly cancers. Severity of a human pancreatic cancer phenotype is driven by a lncRNA-protein complex, which triggers a positive feedback loop of protein overexpression leading to poor patient outcomes [16]. Similarly, formation of a lncRNA-protein complex is associated with poorer prognosis in breast cancer [17], colon cancer [17] and lymphoma [18] by blocking phosphorylation sites, stabilizing other epigenetic factors, and through an unknown mechanism, respectively. Infectious diseases are also associated with LPI dysregulation, including COVID-19 [19,20]. A more exhaustive list of known LPI-disease associations is available at the LncTarD database [21]. Despite the wealth of information on LPI-disease associations, their precise mechanism of action remains unknown. Therefore, insight into LPI will be valuable in complex disease research, potentially resulting in improved diagnosis and treatment procedures.
Multiple high-throughput laboratory assays were developed to investigate LPI, some of which will be briefly discussed in this review article. However, exhaustively performing an experimental validation for each individual LPI is not practical given their volume and variety. Hence, computational methods are necessary to screen these high throughput assays for potential LPI which can then be subsequently experimentally validated, similar to transcriptomics workflows for conventional protein-coding RNA [22]. A variety of these computational LPI predictors exist, each applying different strategies to achieve their goals, and are dependent on a few biological databases containing subsets of experimentally validated LPI. In this review, we will discuss recent bioinformatics resources for studying LPI, with an emphasis on software and databases.

LPI laboratory assays
Because of the biological importance of LPI, many laboratory assays were developed to identify these interactions. Two general categories of such assays exist, protein-centric assays and RNA-centric assays, which can capture either the cellular environment of a living cell or extracted biological material [23]. Protein-centric assays target the protein component of a LPI, while RNA-centric assays target the lncRNA component. Each method varies in sensitivity and specificity, has different prerequisites and has unique advantages as well as disadvantages. Comprehensively comparing and contrasting these laboratory assays is out of scope of this review, but we provide a high-level overview only to give the computational methods discussed in this article some biological context. A more detailed overview of these assays can be found in a separate review article [23].
To discover proteins bound to RNA of interest (RNA-centric methods), IVT (in vitro transcribed) RNA can be tagged with biotin, and selectively bound to streptavidin for purification [24]. RaPID (RNA-protein interaction detection) [25] operates in a conceptually similar way to the previous method. IVT RNA can also be tagged with dyes and bound to protein microarrays, with fluorescence providing a quantitative output [26]. In vivo, cross-linking RNA with protein, either through formaldehyde or UV light, is used to identify LPI by purifying and extracting the RNA-bound proteins. CHART (capture hybridization analysis of RNA targets) [27], ChIRP (Chromatin isolation by RNA purification and capture hybridization analysis of RNA targets) [28], MS2-BioTRAP (MS2 in vivo biotintagged RAP) [29], PAIR (peptide-nucleic-acid-assisted identification of RBPs) [30], RAP (RNA affinity purification) [31] and TRIP (tandem RNA isolation procedure) [32] all use either of these cross-linking strategies.
To discover RNA bound to proteins of interest (protein-centric methods), exploiting cross-linking is also common. The largest group of protein-centric methods are CLIP (cross-linking immunoprecipitation) based methods [33]. Many variants of CLIP methods exist [34], and when paired with high throughput sequencing are capable of generating libraries of data for further analysis. RIP-seq (RNA Immunoprecipitation) [35] and TRIBE (targets of RNA-binding proteins identified by editing) [36] also belong to this category of protein-centric methods.
Starbase, RNAInter, POSTAR, NPInter and RAIN all contain details of curated lncRNA-protein interactions, and many additional attributes (including functional annotation) associated with the interactions, derived from a combination of the laboratory assays discussed in the previous section [ Table S1]. These are not limited exclusively to lncRNA, and contain various other interaction information, including interactions with other ncRNA, other nucleic acids and proteins [44,45,46]. Some contrasts between these databases are also observable from a species, usability and scope perspective, which will be discussed here. Starbase, POSTAR and RAIN contain LPI information from a small number (two to four) of species, while RNAInter and NPInter host a wide range of species. To improve usability, Starbase, RNAInter and RAIN feature third party tool integration to streamline bioinformatics workflows. In terms of scope, POSTAR and NPInter appear to be focused on disease phenotypes, providing disease association information, while Starbase, RNAInter and RAIN have a more generic focus.
ATtRACT and oRNAment databases contain details of RBP (RNA-binding protein) motifs. While not directly containing LPI, these can be applied to predict putative LPI and are a useful starting point or supplementary tool in screening for LPI.
All databases feature at least mouse and human datasets, likely due to their status as model organisms relevant to human disease, although some incorporate other model organisms as well. It is interesting to note that all databases feature advanced querying and search functions, likely reflecting the volume and complexity of LPI data. We have reviewed and compared them in Table S1 [Table S1]. In summary, we discovered that there is a surprising lack of specialized LPI databases, with most databases featuring combinations of other nucleic acid and protein combinations.

LPI prediction algorithms
Most LPI prediction algorithms exploit these curated databases of prior LPI knowledge to tune their predictions. Computational strategies for LPI prediction can be divided into two high-level categories, molecular docking and machine learning. Lowerlevel subdivisions among the methods we surveyed are visualized in Figure 1, and include deep learning, tree-based methods, graph-based methods, similarity networks, image segmentation, matrix factorization and variants of the Fourier transform. Conventional molecular docking methods operate by finding the optimal configuration of a lncRNA-protein complex, and ranking the highest scoring configurations for further evaluation. Within the past decade, a large number of prediction algorithms based on machine learning have emerged. Most machine learning methods do not involve molecular docking simulations. Instead, they exploit known interactions between lncRNA and protein and/or biomolecular sequence information directly, although many also leverage known secondary structures to improve their performance Table [1,2]. As with the LPI databases, it is worth noting that none of these methods are tuned specifically for LPI prediction, and represent broader scopes of identifying combinations of nucleic acid-protein interaction.

Molecular docking approaches
Before the current ecosystem of machine learning algorithms was established, molecular docking was the dominant strategy used to predict and investigate LPI or RNAprotein interactions in general. By developing custom equations, which account for conformation and other steric properties, the likelihood of lncRNA-protein complex formation is scored. Low-level methodology does not vary significantly, with most methods applying a variant of the FFT (Fast Fourier transform) to extract features from three-dimensional molecule representations, template or optimizing for a minimal energy state. Key factors considered include docking pose, distance and area of interracial sites, energybased criteria, and selection of the most structurally conserved docked complex [47]. Several methods also account for sequence homology or electrical charge between biological molecules [48]. Hierarchical clustering to group complexes of interest is not uncommon. However, at a high-level these strategies are applied in different ways, and on different steric features. In many cases, a set of parameters must be specified by the user.
Most of the molecular docking methods we reviewed use methods which incorporate at least two of the previously discussed low-level methodologies [ Table 1]. To provide some context for the building blocks of these more complex methods, we first present examples of methods that use an individual strategy, which include 3dRPC [49], Hex-Server [50], FireDOCK [51], HADDOCK [52] and PatchDOCK [53]. 3dRPC and HexServer are FFT-based methods. 3dRPC exploits the fact that LPI complexes have looser packing, and implements FFT on geometric complementarity and electrostatics with a custom scoring function. HexServer uses an FFT-based algorithm to exploit shape complementarity as a feature for optimization. Its key advantage is its reformulation of the conventional 3D search space to greatly boost the speed of the FFT, achieving results in seconds. Meanwhile, FireDOCK and HADDOCK optimize the minimum free energy of the lncRNAprotein complex. While FireDOCK and focuses on exploiting side chain information, HADDOCK leverages ambiguous interaction restraints, and is one of the few methods which can generalize to multi-body problems as well as other biomolecular interactions. Among molecular docking tools, PatchDOCK takes a more unconventional strategy by summarizing low-level geometric features into higher level features, and has some conceptual similarities to image segmentation. It is interesting to note that FireDOCK and PatchDOCK both complement each other, where PatchDOCK can feed output directly into FireDOCK.
Methods implementing a mixture of these strategies include HDOCK [54], MPRDOCK [55], P3DOCK [56] and NPDOCK [57]. HDOCK integrates template-based modeling as well as ab initio free docking, with a scope that extends to both proteins and nucleic acids. In addition, the user may specify binding sites of interest directly. MPRDOCK exploits protein flexibility by applying FFT and considering sequence homology of the target of interest to generate a repertoire of structures for "ensemble docking". We note that in this specific context of MPRDOCK, "ensemble docking" refers to the library of proteins generated by MPRDOCK, and is distinct from "ensemble learning" in the machine learning section [65,66,67] where the output of multiple algorithms are aggregated to obtain a result. P3DOCK (http://www.rnabinding.com/P3DOCK/P3DOCK.html) integrates the previously discussed 3dRPC, as PRIME that leverages sequence as well as structural homology in addition to the features used by 3dRPC. P3DOCK's authors claim that by complementing free docking and template-based docking strategies in a hybrid approach, a more accurate classification is possible. Finally, NPDOCK does not use a hybrid or ensemble strategy, but chains multiple methods into a pipeline of tools, which implement mostly FFT-based methods. FFT-based algorithm to exploit shape complementarity as a feature for optimisation http://hexserver.loria.fr/ [50] With the exception of one or two methods such as HexServer, many of these algorithms are computationally expensive and time-consuming (hours to days of real time) to run. Some methods like HexServer require advanced hardware such as GPUs and specialized software engineering tools. Biological molecules are complex and dynamic, with their wide range of possible conformations as well as orientations greatly increasing the search space for algorithms. The molecular docking community is mindful of this, and provides their software on publicly accessible and user-friendly web servers for users to run these programs remotely, although time remains a bottleneck for these workflows.

Machine learning approaches
Most modern lncRNA-protein interaction (LPI) prediction algorithms use machine learning, where large datasets with attributes of interest are passed to an algorithm [ Table  2]. The algorithm then "learns" from the data, discovering patterns in the data with minimal human intervention such as user-defined equations. In the case of LPI, known LPI and their corresponding sequences as well as structures are used for training the prediction models. Their strategies can be divided into several broad categories, including graph methods, ensemble learning, matrix factorization and deep learning. Of these strategies, matrix factorization appears to be the most popular and is integrated into many other higher-level strategies. LPI are commonly formulated as similarity matrices, which can then be easily formulated as a matrix factorization problem. Broader strategies incorporating matrix factorization, such as ensemble learning and methods which leverage multimodal data appear to have consistently robust performance. Few deep learning models exist, but they both perform and generalize well in comparison to other methods, and are likely to become more popular as they have become in other areas of biology.
Matrix factorization is the most common way to formulate LPI for prediction algorithms, including LPI-FKLKRR (LncRNA-Protein Interaction Kernel Ridge Regression, based on Fast Kernel Learning) [58], LPI-KTASLP (Prediction of LncRNA-Protein Interaction by Semi-Supervised Link Learning With Multivariate Information) [59], LPI-NRLMF (lncRNA-protein interaction prediction by neighborhood regularized logistic matrix factorization) [60], LPI-INBRA (Long non-coding RNA-Protein Interaction Prediction based on Improved Bipartite Network Recommender Algorithm) [61] and LPI-BNPRA (Long non-coding RNA-Protein Interaction bipartite network projection recommended algorithm) [62]. These methods share a common theme of formulating lncRNA-protein interactions as a matrix factorization problem and using them in broader strategies such as multiple kernel learning or recommender algorithms. Known structural features are often used together with sequence features. In the special case of LPI-FKLKRR, matrices are reformulated into kernels for direct optimization with kernel ridge regression, increasing performance in the common scenario of class imbalance.
Some graph-based methods for LPI prediction are PBLPI (path-based lncRNA-protein interaction) [63] and PLPIHS (Predicting lncRNA-Protein Interactions using HeteSim Scores) [64]. PBLPI takes into account both functional and semantic similarity between proteins, while PLPIHS uses a custom distance metric to unify co-expression, lncRNAprotein interactions and protein-protein interaction scores to construct a network which is then provided to a SVM classifier. Performance is improved by preserving information regarding the biological network, taking into account lncRNA-protein interactions similar to the target.
Examples of hybrid and ensemble learning approaches are IRWNRLPI (Integrating Random Walk and Neighborhood Regularized Logistic Matrix Factorization for lncRNA-Protein Interaction Prediction) [65], SFPEL-LPI (sequence-based feature projection ensemble learning method) [66], HLPI-Ensemble (human lncRNA-protein interactions ensemble) [67], GPLPI (graph predict lncRNA-protein interaction) [68] and LPI-BLS (predicting lncRNA-protein interactions with a broad learning system-based stacked ensemble classifier) [69]. IRWNRPLI uses lncRNA-protein interactions and lncRNA/protein sequence similarity as input into a hybrid approach of random walk and neighborhood regularized logistic matrix factorization. Being an integrative model, it appears to be robust, although its accuracy varies on different biological systems. Ensemble approaches PMKDN, SFPEL-LPI, HLPI-Ensemble and LPI-BLS are all robust against noise due to their ensemble strategy incorporating multiple approaches, and are capable of discovering new LPI. LPI-BLS in particular stands out for its unconventional flat network architecture and aggregation strategy. However, we note that HLPI-Ensemble is specifically intended for human LPI only. GPLPI uses both sequence features and known secondary structures to train a graphbased neural network. In addition, by using an ensemble of features including evolutionary information, GPLPI's effectiveness was increased. An important distinction between these two methods is that GPLPI is trained on known plant lncRNA, and plant non-coding RNA have different properties (some ncRNA lose function even with 1-2 nucleotide changes) to that of animal non-coding RNA [70]. For this model to be effective on nonplant organisms, retraining is likely necessary but viable due to the relatively higher volume of data associated with animals, in particular humans [67].
Only a few deep learning approaches exist, DeepBind [70], LPI-CNNCP (lncRNAprotein interactions convolutional neural network copy-padding trick) [71] and DeepLPI (deep lncRNA-protein interactions) [72]. DeepBind was one of the first applications of deep learning to predict nucleic acid-protein binding, and is applicable to LPI. By reformulating the classical position weight matrix [73] as a convolutional kernel, it operates on raw sequence data to provide a simple prediction score for a nucleic acid-protein interaction [74]. LPI-CNNCP uses only lncRNA and protein sequence data recorded as k-mers as input into a CNN but achieves good results. It is also interesting to note that it appears to be one of the few models that are effective across different species. Meanwhile, DeepLPI feeds co-expression, sequence and structural data to a neural network optimized by a conditional random field. Using protein isoform data makes DeepLPI the only method to date with the ability to predict lncRNA interaction with different protein isoforms. Furthermore, its flexibility allows it to be extended to other biomolecular interactions such as miRNA.
Other methods used to predict LPI that do not fall into a specific category include LPI-SKF (lncRNA-protein interaction similarity kernel fusion) [75], PMKDN (projectionbased neighborhood non-negative matrix decomposition model) [76] and LPI-MiRNA [77]. LPI-SKF uses an integrative approach where verified lncRNA-protein interactions are used to build a network, and similarity kernel fusion is used to integrate protein and lncRNA similarity scores before applying manifold learning. PMKDN uses multiple features from lncRNA (nucleotide composition, expression levels) and protein (amino acid subcategories) to build a similarity matrix for similarity network fusion with a nearest neighbor's approach. Both these methods are robust against noise and capable of interaction discovery, but like most methods that express LPI as similarity matrices, they make a strong assumption that sequence homology correlates with interactivity, which may not hold in all cases. LPI-MiRNA takes a unique approach, exploiting miRNA as an intermediate unit of lncRNA-protein binding, and uses this in a network-based approach. While this gives LPI-MiRNA the ability to operate on datasets without prior knowledge of lncRNA interactions, a different limitation is introduced of relying on known miRNA-lncRNA and miRNA-protein interactions. An assumption is also made that miRNAs which interact with both lncRNA and a protein would also form LPI, which may not always hold. Nevertheless, this method was shown to be effective.
lncPro [78] and catRAPID [79] are older methods but are featured in this manuscript because of their historical significance. lncPro was one of the first published machine learning LPI prediction algorithms, and many LPI algorithms resemble it. Higher-level features are extracted from lncRNA and protein sequence, which are then recorded as vectors as input into their model. Although the authors noted limitations associated with data availability and computational complexity at the time, this method became a template for many other machine learning methods, including those discussed in this manuscript. catRAPID does not apply machine learning, but instead constructs an interaction matrix from known secondary structure and other molecular features. A major limitation of this approach is its reliance on obsolete genomic data, which is expected to reduce prediction accuracy.
However, it is important to note that the scope of most LPI prediction algorithms are limited. Not all methods can predict interactions for novel lncRNA or proteins, and few methods generalist across species [62,69,71]. This is partly due to the limited availability of curated training data, with a small number of samples mostly from human or mouse present in a few databases [66,67,69]. LPI prediction for different protein isoforms is also not an active area of prediction algorithm development, with only one method having this functionality. Another limitation observed is that some methods exploit sequence similarity as an intermediate metric for LPI prediction, particularly methods which formulate LPI as similarity matrices. While this appears to be effective within the specific training datasets used by each study, this implicit assumption of similar sequence homology correlating to interactivity may not always hold, especially across different species [80,81]. At the same time, we consider that small nucleotide changes in biological molecules can cause major functional changes, which can potentially cause improperly trained prediction algorithms to produce misleading results [82].
We also note the limited accessibility of many of these machine learning methods. Among the methods reviewed that were published within the last five years, many do not make their source code publicly available and/or are written in proprietary programming languages such as MATLAB [83]. This restricts reproducibility and prevents usage of more than half of the methods we reviewed [ Table 2]. At least, partly because of the computational complexity required, machine learning methods which are well suited to resolving non-linear variables in high dimensional data have recently become a focus of the LPI field. Although, computational methods that integrates the identification and functional annotation of LPI are not yet developed or established, which leaves a void that has to be filled.
In contrast to published molecular docking algorithms, only a few methods provide active web servers for convenient use by the community, further raising the barrier for usability by biologists. Table 2. A comparison of machine learning algorithms used to predict lncRNA-protein interactions. Important attributes of these machine learning algorithms, including their scope, strategies, training data, effectiveness and reproducibility are listed. More than half of these methods are not reproducible as their source code is proprietary or not available. A few methods provide web interfaces for users to enter their own data

Conclusions
LPI forms a unique layer of gene regulation across many species, and a growing interest in the field has resulted in the creation and expansion of curated databases as well as LPI prediction algorithms. Here, we are reviewing some of the established (older than five years) and recent (within the last five years) LPI prediction approaches as well as databases. We note four important points. First, there has been a clear and recent shift from conventional molecular docking algorithms to machine learning methods, which attempts the direct prediction of LPI from biomolecular sequence identity and higher-level features. This shift to machine learning is observable across different fields of biology and is likely to continue with the rising availability of computational infrastructure and machine learning expertise. Secondly, these methods are heavily dependent on a set of curated data across several databases. Across these databases, a lack of universal standardization complicates data merging [84], preventing the community from unlocking the full potential of LPI data, in contrast to conventional transcriptomics databases such as SRA [85], EBI [86] and DDBJ [87]. This is in part due to the diversity of assays used to capture the LPI information, as well as the scope of the databases, which may subsequently bias the machine learning algorithms developed on these data. Third, there is a distinct lack of methods and databases which are specifically designed for LPI's unique properties, with most having a generic scope despite LPI's biological significance. Finally, it is concerning that more than half of the recent machine learning methods we surveyed are not reproducible or usable due to the absence of their source code. However, LPI acts as an important but less-studied regulatory layer and understanding them will provide key context to deepen our understanding of biological systems.

Supplementary Materials:
The following are available online at www.mdpi.com/xxx/ Table S1: LncRNA-protein data repositories (Table-1 S1). Seven databases, four with LPI information and three with RNA motif information are surveyed. Each database holds information on at least one combination of nucleic acid and protein interaction. The number of species each database contains varies widely, from 4-154. Every database contains at least human and mouse data, and has been updated within the past five years.