A Novel Probability Model for LncRNA–Disease Association Prediction Based on the Naïve Bayesian Classifier

An increasing number of studies have indicated that long-non-coding RNAs (lncRNAs) play crucial roles in biological processes, complex disease diagnoses, prognoses, and treatments. However, experimentally validated associations between lncRNAs and diseases are still very limited. Recently, computational models have been developed to discover potential associations between lncRNAs and diseases by integrating multiple heterogeneous biological data; this has become a hot topic in biological research. In this article, we constructed a global tripartite network by integrating a variety of biological information including miRNA–disease, miRNA–lncRNA, and lncRNA–disease associations and interactions. Then, we constructed a global quadruple network by appending gene–lncRNA interaction, gene–disease association, and gene–miRNA interaction networks to the global tripartite network. Subsequently, based on these two global networks, a novel approach was proposed based on the naïve Bayesian classifier to predict potential lncRNA–disease associations (NBCLDA). Comparing with the state-of-the-art methods, our new method does not entirely rely on known lncRNA–disease associations, and can achieve a reliable performance with effective area under ROC curve (AUCs)in leave-one-out cross validation. Moreover, in order to further estimate the performance of NBCLDA, case studies of colorectal cancer, prostate cancer, and glioma were implemented in this paper, and the simulation results demonstrated that NBCLDA can be an excellent tool for biomedical research in the future.


Introduction
Long non-coding RNAs (lncRNAs), those with over 200 nucleotides in length [1][2][3], are considered a new class of non-protein-coding transcripts. Much research evidence has shown that lncRNAs participate in almost the entire cell life cycle through various mechanisms and play significant roles in multiple biological processes including transcription, translation, epigenetic regulation, splicing, differentiation, immune response, cell cycle control, and so on [4][5][6][7][8]. In particular, the mutations and dysregulations of lncRNAs have been proven to be closely related to various human complex diseases [9][10][11], including AIDS [12], diabetes [13], Alzheimer's Disease (AD) [14], and many types of cancers such as breast [15], prostate [16], hepatocellular [17], and bladder cancer [18]. For instance, the expression of the lncRNA called HOTAIR was shown to be higher in primary breast tumors and metastases, and the HOTAIR expression level was proven to be a powerful predictor of eventual metastasis and death [19,20]. Additionally, the lncRNA MALAT1 was demonstrated as a prognostic indicator as well as a therapeutic target and acts as a potential therapeutic method for preventing lung cancer metastasis, which is targeted by antisense oligonucleotides (ASO) [21]. Moreover, recent studies have shown that the human H19 gene is frequently overexpressed in the myometrium and stroma during pathological endometrial proliferative events [22].
Obviously, predicting potential associations between lncRNAs and diseases would contribute to systematically understanding the pathogenesis of complex diseases at the molecular level and facilitate the identification of biomarkers for disease diagnosis, treatment, and prediction of response to therapy. However, relatively few experiments have supported lncRNA-disease associations until now. Hence, developing effective computational methods to uncover the potential associations between lncRNAs and diseases has become a hot topic in recent years. In general, existing models for predicting potential associations between lncRNAs and diseases can be divided into three categories. Among them, the first kind of methods are based on known disease-related lncRNAs. For example, Sun et al. proposed a model named RWRlncD [23], which carried out a random walk with the restart method on an lncRNA functional similarity network. This method uncovered potential associations between lncRNAs and diseases by integrating the disease similarity network, the lncRNAs functional network, and known lncRNA-disease associations. Ping et al. developed a method based on a newly constructed bipartite network, which relies on the known associations between lncRNAs and diseases [24]. Yang et al. constructed a coding-non-coding gene-disease bipartite network based on known associations between diseases and disease-causing genes (including lncRNAs). Then, they developed an iterative algorithm to uncover the possible links in the newly constructed bipartite network [25]. Ding et al. proposed a new model named TPGLDA to predict potential lncRNA-disease associations by integrating gene-disease associations with lncRNA-disease associations [26].
Different from the first kind of methods based on known lncRNA-disease associations, the second category of prediction models does not rely on known disease-related lncRNAs. For example, Chen et al. proposed a new method called HGLDA by integrating micro-RNA (miRNA)-disease associations and lncRNA-miRNA interactions. A hypergeometric distribution test is then applied to identify potential lncRNA-disease associations [27]. Liu et al. developed a computational framework by integrating human lncRNA expression profiles, gene expression profiles, and human disease-associated gene data to predict potential human lncRNA-disease associations [28]. Li et al. put forward a prediction method on account of the information of genome location to globally discover potential human lncRNAs related to vascular disease [29]. Gu et al. proposed a random walk-based model to identify potential associations between lncRNAs and diseases, which can be applied for predicting a disease without known associated lncRNAs and for inferring an lncRNA without known associated diseases [30].
In recent years, an increasing number of studies have been developed for understanding the cellular process, molecular interactions, and the pathogenesis of complex diseases at the molecular level by integrating different types of data and molecular interaction networks [31]. Such research includes the prediction of gene-disease associations [32], and the prediction of potential disease-associated miRNAs [33,34]. An increasing number of researchers have also adopted various data frameworks to increase the reliability of association prediction between diseases and lncRNAs. Hence, a third kind of prediction models has been proposed, in which multiple data sources are integrated to identify disease-related lncRNAs. For example, Lu et al. proposed a new prediction of lncRNA-disease associations via inductive matrix completion (named SIMCLDA), by integrating known lncRNA-disease interactions, disease-gene, gene-gene ontology associations [35]. Zhang et al. developed a novel model named LncRDNetFlow, which utilized a flow propagation algorithm to integrate a variety of information including the similarity of lncRNAs, the protein-protein interactions, and the similarity of diseases to infer lncRNA-disease associations [36]. Fu et al. proposed a model called MFLDA to predict potential lncRNA-disease associations by considering the quality and relevance of different heterogeneous data sources, which can select and integrate the data sources by assigning different weights to them [37]. Chen developed a path-based approach named KATZLDA for discovering potential lncRNA-disease associations by integrating information including known lncRNA-disease associations, lncRNA expression profiles, lncRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity [38]. All of these above data fusion-based methods can achieve effective results.
In this paper, to effectively predict potential lncRNA-disease associations, we first constructed a global tripartite network by integrating three kinds of heterogeneous networks including an lncRNA-disease association network, an miRNA-disease association network, and an miRNA-lncRNA interaction network. Then, considering that more heterogeneous networks can boost the prediction performance, we constructed a quadruple global network by appending a gene-lncRNA interaction network, a gene-disease association network, and a gene-miRNA interaction network to the tripartite network. Thereafter, based on these two newly constructed global networks, a novel probabilistic model named Naïve Bayesian Classifier used to predict potential LncRNA-Disease Associations (NBCLDA), based on the naïve Bayesian classifier, is proposed to uncover potential lncRNA-disease associations. Moreover, in order to evaluate the prediction performance of the NBCLDA, the leave-one-out cross-validation (LOOCV) framework was implemented, and the experimental results demonstrated the effective performance of the NBCLDA and illustrated that it can achieve better predictive performance than state-of-the-art methods in the terms of LOOCV.

Data Collection and Preprocessing
Considering that more heterogeneous data sources can boost the performance of prediction models, in this paper, to construct our novel prediction model NBCLDA-with the ultimate goal being to infer potential associations between lncRNAs and diseases-seven heterogeneous data sets were combined. These include the sets of miRNA-disease, miRNA-lncRNA, lncRNA-disease, gene-disease, and gene-lncRNA associations, as well as the sets of gene-miRNA interactions, and of diseases with disease tree numbers. The sets were collected from various databases.

Construction of miRNA-Disease and miRNA-lncRNA Association Sets
In this article, the miRNA-disease and miRNA-lncRNA association sets were downloaded from the HMDD [39] and the starBase v2.0 [40] databases in January 2015. Once these two data sets were collected, we removed any duplicate associations with conflicting evidence. Then, we further unified the names of miRNAs, and, thereafter, manually selected the common miRNAs in both sets. Finally, we retained only the associations related with those selected miRNAs in these two data sets. As a result, we obtained a data set DS 1 consisting of 4704 miRNA-disease interactions between 246 miRNAs and 373 diseases, and a data set DS 2 consisting of 9086 miRNA-lncRNA interactions between 246 miRNAs and 1089 lncRNAs (see Supplementary Materials Tables S1 and S2).

Construction of the lncRNA-Disease Association Set
In this paper, the set of lncRNA-disease associations was collected from the MNDR v2.0 database [41] in 2017. In a similar way, once the data set was collected, we removed the duplicate associations with conflicting evidence. Then, we selected the lncRNA-disease associations with diseases belonging to DS 1 and lncRNAs belonging to DS 2 simultaneously. As a result, we obtained a data set DS 3 consisting of 407 lncRNA-disease associations between 77 lncRNAs and 95 diseases (see Supplementary Materials Table S3). The data set DS 3 is utilized as the test sample in our following simulation experiments.

Construction of the Gene-Disease and Gene-lncRNA Association Sets
In this article, the set of gene-disease associations was gathered from the DisGeNET v5.0 database [42] in May 2017, and the set of gene-lncRNA associations was downloaded from the LncACTdb v1.0 database [43]. Again, we removed the duplicate associations with conflicting evidence. Then, we further unified the names of genes, and thereafter manually selected the common genes in both sets. Finally, we retained only the associations related with those selected genes in these two data sets. Additionally, we transformed some disease names included in the newly constructed set of gene-disease associations into their aliases in the DS 1 , in order to keep the uniformity of disease names. For example, the disease names "pulmonary Emphysema" and "Bladder Neoplasm" in the newly collected set of gene-disease associations was converted into "pulmonary Embolism" and "Bladder Neoplasms" in the DS 1 , respectively. Hence, we obtained a data set DS 4 consisting of 3702 gene-disease associations between 171 genes and 227 diseases, and a data set DS 5 consisting of 411 gene-lncRNA interactions between 171 genes and 66 lncRNAs (see Supplementary Materials Tables S4  and S5).

Construction of the Gene-miRNA Association Set
In this paper, the set of gene-miRNA interactions was obtained from the miRecords [44] database that was last updated in April 2013. Once the data set was collected, we removed the duplicate associations with conflicting evidence. Then, we selected the gene-miRNA interactions with genes belonging to DS 4 or DS 5 and miRNAs belonging to DS 1 or DS 2 , simultaneously. Finally, as a result, we obtained a data set DS 6 consisting of 565 gene-miRNA associations between 109 genes and 174 miRNAs (see Supplementary Materials Table S6).

Construction of the Set of Diseases with Disease Tree Numbers
In this article, the set of diseases with Disease tree numbers was gathered from the MeSH database [45] . In the MeSH database, the disease terms, described as DAGs, were classified and signified as disease tree numbers. We browsed the MeSH database and collected the disease tree numbers of diseases in DS 1 . As a result, we obtained a data set DS 7 consisting of 373 diseases with their disease tree numbers (see Supplementary Materials Table S7).

Analysis of Multi Relational Data Sources
In our model, four object types such as lncRNA, diseases, miRNA, and genes are considered. Based on these four object types, we collect six relational data sources from different databases. Figure 1 is constructed to illustrate the relationship between these different data sources more directly. In Figure 1, R # 1 Ω−# 2 Ω denotes the different associations between these four object types, where # 1 represents one object, # 2 represents another object and Ω denotes the dataset DS Ω that the two objects belong to. For example, R m1−d1 denotes the associations between miRNAs and diseases, m represents miRNAs, d represents diseases, and '1' indicates all these miRNAs and diseases belong to the dataset DS 1 . In addition, the numbers of the same objects in the different datasets and the relationships among them are shown in Figure 1. For instance, the number of diseases is 373 in R m1−d1 , 95 (= 29 + 66) in R l3−d3 and 227 (= 66 + 161) in R g4−d4 , and it is obvious that both the 95 diseases in R l3−d3 and the 227 diseases in R g4−d4 are part of the 373 diseases in R m1−d1 ; moreover, the intersect of disease in R l3−d3 and R g4−d4 includes 66 different diseases.

Method
As illustrated in Figure 2, our newly proposed model NBCLDA for predicting potential associations between lncRNAs and diseases can be mainly divided into the following steps: Step 1: As illustrated in Figure 2a, on the basis of data sets DS 1 , DS 2 , and DS 3 we can construct an miRNA-disease association network labeled MDN, an miRNA-lncRNA association network labeled MLN, and an lncRNA-disease association network labeled LDN.
Step 2: As illustrated in Figure 2b, by integrating the three association networks constructed in Step 1, we can easily obtain a global tripartite network GN 1 of lncRNA-miRNA-disease relationships.
Step 3: As illustrated in Figure 2c, in order to utilize multiple data sources to improve the prediction performance, on the basis of data sets DS 4 , DS 5 , and DS 6 obtained above, we can also construct a gene-disease association network labeled GDN, a gene-lncRNA association network labeled GLN, and a gene-miRNA association network labeled GMN.
Step 4: As illustrated in Figure 2d, by appending the three association networks constructed in Step 3 to GN 1 constructed in Step 2, we can easily obtain a global quadruple network GN 2 of lncRNA-miRNA-gene-disease relations.
Step 5: As illustrated in Figure 2e,f, after applying the naïve Bayesian classifier theory to GN 1 and GN 2 , we can obtain two kinds of prediction models: NBCLDA-GN 1 and NBCLDA-GN 2 .
Step 6: As illustrated in Figure 2g,h, in order to further improve the prediction performance of the NBCLDA, we implemented disease semantic similarity in NBCLDA-GN 1 and NBCLDA-GN 2 . Thus, we can obtain two new prediction models, NBCLDA-GN 1 -SD and NBCLDA-GN 2 -SD, to infer potential lncRNA-disease associations.
In the same way, we can further represent the miRNA-lncRNA interaction network, MLN, and the lncRNA-disease association network, LDN, as MDN = (M, L, E 2 ) and LDN = (L, D, E 3 ), where E 2 = {e m k −l i |m k ∈ M, l i ∈ L} denotes the set of known interactions between the miRNAs in M and the lncRNAs in L; E 3 = {e l i −d j |l i ∈ L , d j ∈ D } represents the set of interactions between the lncRNAs in L and the diseases in D . Thus, the edge e m k −l i ∈ E 2 ⇔ m k is associated with l i , and the edge e l i −d j ∈ E 3 ⇔ l i is associated with d j . Finally, the global tripartite network, GN 1 , is expressed as  (g,h) inference of potential lncRNA-disease associations by using disease semantic similarity. Here, in (e-h), the known lncRNA-disease associations are represented as the solid edges, and the candidate lncRNA-disease associations are represented as dashed edges.

Construction of GDN, GLN, GMN, and GN 2
Let D be the set of r diseases in DS 4 , L be the set of n lncRNAs in DS 5 , G be the set of p genes in DS 4 or DS 5 , G be the set of p genes in DS 6 , and M be the set of t miRNAs in DS 6 . Additionally, from Sections 2.3 and 2.4, it is clear that D ⊆ D, L ⊆ L, and G ⊆ G; hence, we can let D = {d 1 , d 2 , ..., d r }, L = {l 1 , l 2 , ..., l n }, G = {g 1 , g 2 , ..., g p }, G = {g 1 , g 2 , ..., g p , g p +1 , ..., g p }, and M = {m 1 , m 2 , ..., m t }. We can thus represent the gene-disease association network, GDN, between the genes in G and the diseases in D . That is, the edge e g f −d j ∈ E 4 ⇔ g f is associated with d j .
In the same way, we can further represent the gene-lncRNA interaction network, GLN, and gene-miRNA interaction network, GMN, as GLN = (G, L, E 5 ) and GMN = (G, M, the set of known gene-lncRNA interactions and the set of known gene-miRNA interactions, respectively. In other words, the edge e g f −l i ∈ E 5 ⇔ g f is associated with l i and the edge e g f −m k ∈ E 6 ⇔ g f is associated with m k .
Finally, it is evident that the global tripartite network GN 2 can be expressed as

Construction of NBCLDA
The naïve Bayesian classifier is a simple probabilistic classifier with a naïve independence assumption that any feature of a class is independent of the other features of the class. Abstractly, based on the Bayesian classifier probability model p(C|F 1 , F 2 , ..., F n ), where C is a dependent class variable and F 1 , F 2 , ..., F n are the feature variables of class C, the posterior probability can be described as follows: Furthermore, according to the above assumption, since each feature F i is conditionally independent of every other feature F j (i = j), Equation (1) can be expressed as: Inspired by existing probabilistic models based on Bayesian theory to predict missing links in complex networks [46], we designed a prediction model NBCLDA to infer potential disease-related lncRNAs; we applied the naïve Bayesian theory to GN 1 and GN 2 , constructed in Sections 3.1 and 3.2, respectively. In the context of Equation (1), in NBCLDA, the associations between lncRNAs and diseases in GN 1 and GN 2 are considered as the class of variables, while the common neighboring nodes of every lncRNA-disease pair in GN 1 and GN 2 are considered as the feature variables. In particular, when applying the naïve Bayesian theory to GN 1 , for any given pair of lncRNA and disease nodes in GN 1 , we will consider that their common neighboring miRNA nodes are all conditionally independent of each other, since all of the miRNAs are different, and, therefore, we assume that each of the miRNAs will not affect the others. To illustrate this assumption more intuitively, we provide an example in Figure 3a, in which the common neighboring nodes m 1 and m 3 between l 2 and d 3 will be assumed to be conditionally independent.
However, when applying the naïve Bayesian theory to GN 2 , as there are two types of common neighboring nodes, miRNAs and genes, between a pair of lncRNA and disease nodes. In this case, it is unreasonable to consider that all of these common neighbors are conditionally independent of each other, since there may exist interactions between genes and miRNAs. Therefore, for any given pair of lncRNA and disease nodes in GN 2 , let φ be the set that consists of all their common neighboring nodes. Then, for any miRNA node m * , if there is a gene node g * that is associated with m * , we will consider the miRNA m * and its related gene g * as a whole, and denote them as m * -g * and label this an miRNA-gene pair. By this means, it is obvious that there will be three kinds of features in φ-miRNAs, genes, and miRNA-gene pairs. Hence, we assume that these three kinds of elements in φ are conditionally independent of each other. To illustrate this assumption more intuitively, we present an example in Figure 3b, in which, m 1 , m 3 , g 1 , and g 4 are the common neighboring nodes between l 2 and d 3 , and we will assume that m 3 -g 4 , m 1 , and g 1 are conditionally independent of each other. For any given lncRNA node l i and disease node d j in GN 1 , let N(l i ) and N(d j ) be the sets of neighboring nodes that are directly connected to l i and d j , respectively. From this, we construct CN(l i , d j ) = {m 1 , m 2 , ..., m h }, which denotes the set consisting of all common neighboring nodes between l i and d j in GN 1 . Then, the prior probabilities for the existence of an relationship edge e l i −d j are calculated via: where |M c | denotes the number of known associations between lncRNAs and diseases in LDN, and |M| = n × r, where n denotes the number of lncRNAs in L and r denotes the number of diseases in D.
Based on the naïve Bayesian classifier, the posterior probabilities for an edge e l i −d j , representing whether the node l i is connected to d j in GN 1 , are defined as follows: From Equations (5) and (6), we can directly identify whether an lncRNA node is connected with a disease node or not in GN 1 . However, since it is often too complicated to calculate the value of p (CN(l i , d j )), we first define the probability of a potential association existing between l i and d j in GN 1 as follows: where p(m δ |e l i −d j =1) and p(m δ |e l i −d j =0) are the conditional probabilities of a node m δ belonging to CN(l i , d j ); they represent the possibilities of whether the node is a common neighboring node between l i and d j in GN 1 or not, respectively. Moreover, according to Bayesian theory, these two conditional probabilities can be expressed as: where p(e l i −d j =1|m δ ) and p(e l i −d j =0|m δ ) represent the conditional probability of whether the lncRNA node l i is connected to the disease node d j or not, respectively, and m δ is one of the common neighboring nodes between l i and d j in GN 1 . Thus, p(e l i −d j =1|m δ ) and p(e l i −d j =0|m δ ) are calculated via the following formulas: where N + m δ and N − m δ denote the number of known and unknown associations between lncRNAs and diseases whose common neighbors include m δ , respectively.
Hence, from Equations (8) and (9), Equation (7) can be modified as follows: Moreover, given any two nodes l i and d j in GN 1 , the value of p(e l i −d j =0) is a constant, which we denote as φ m for convenience. Additionally, for each common neighboring node between l i and d j in GN 1 , let N l denote the number of lncRNAs directly related to m δ , and N d denote the number of diseases directly related to m δ . Then, N + m δ + N − m δ = N l × N d , and hence, Equation (7) can further be modified as follows: Considering that N + m δ may equal zero, we will introduce the Laplace calibration to guarantee that the value of S1(l i , d j ) will not be zero: Furthermore, by introducing the logarithmic function for standardization, for any given lncRNA node l i and disease node d j in GN 1 , we can finally define the probability of a potential association existing between them as: where λ is a constant utilized for normalization.

Method for Applying the Naïve Bayesian Theory to GN 2
In the same manner as described in Section 3.3.1, for any given lncRNA node l i and disease node d j in GN 2 , we construct the set consisting of all common neighboring nodes, CN (l i , d j ) = {m 1 , m 2 , ..., m h , g 1 , g 2 , ..., g u }. Then, the posterior probabilities of p (e l i −d j =1|CN (l i , d j )) and p (e l i −d j =0|CN (l i , d j )), representing whether the node l i is connected to d j in GN 2 or not, respectively. Then, similarly as described in Section 3.3.1, we can define the probability of a potential association existing between l i and d j in GN 2 as follows (the deep representation of scheme are described in Supplementary Material): , (16) where N + mᾱ,gβ and N − mᾱ,gβ denote the number of known and unknown associations between l i and d j in GN 2 , respectively, conditional on mᾱ and gβ being common neighboring nodes between l i and d j in GN 2 and mᾱ-gβ is an miRNA-gene pair. In addition, N + m α and N − m α denote the number of known and unknown associations between l i and d j in GN 2 , respectively, conditional on m α being a common neighboring node between l i and d j . In addition, N + g β and N − g β represent the number of known and unknown associations between l i and d j in GN 2 , respectively, conditional on g β being a common neighboring node between l i and d j . Finally, following the example of Equation (15), we can finally define the probability of a potential association existing between l i and d j in GN 2 as follows:

Method of Appending the Disease Semantic Similarity into NBCLDA
The disease semantic similarity has been widely utilized as a valuable data source for discovering potential disease-related lncRNAs in many previous studies [30,38]. In this paper, we append the disease semantic similarity into our newly constructed prediction model NBCLDA to further uncover the potential relationships between lncRNAs and diseases.
From the description given in Section 2.5, we know that each disease term in the MeSH database can be described as a directed acyclic graph (DAG), in which the nodes represent the disease MeSH descriptors and all MeSH descriptors in the DAG are linked from more general terms (parent nodes) to more specific terms (child nodes) by a direct edge. Hence, in this paper, we first obtain the disease tree numbers according to the disease terms collected from the MeSH database. Thereafter, adopting the method proposed by Wang et al. [47], while supposing that disease d j is represented as the graph DAG d j = (d j , T d j , E d j ), where T d j is the set of all ancestor nodes of d j including node d j , E d j is the set of corresponding links, and the contribution of a disease t in DAG d j to the semantic of disease d j can be calculated as follows: where ∆ is the semantic contribution factor for edges E d j linking disease d j with child disease t and the disease d j is the most specific disease and its own semantic score is defined as 1. Since nodes located farther from d j will be more general diseases that contribute less to d j , then, based on Equation (24), we can define the semantic value of the disease d j as follows: Therefore, based on the assumption that the diseases share the nodes of their DAGs, the semantic similarity between disease d j and d i can be defined as: Finally, based on the disease semantic similarity and the similarities between lncRNAs and diseases, we can reconstruct a new recommended measurement for inferring potential associations between lncRNAs and diseases as follows: where S denotes either S1 (l i , d j ) or S2 (l i , d j ) and SD, which is computed via Equation (20) denotes the disease semantic similarity.

Performance Evaluation
The performance of the NBCLDA, for inferring potential associations between lncRNAs and diseases, is evaluated by implementing LOOCV and is based on experimentally verified lncRNA-disease associations. At each round, a known lncRNA-disease association is used as a test sample, whereas all the remaining associations are taken as training cases for model learning. This step continues until each sample is treated as a verification sample. Moreover, the value of area under the receiver operating characteristic (ROC) curve (AUC) can be applied for measuring the overall performance of the method. The closer the AUC value is to 1, the better the performance is, and an AUC value of 0.5 refers to a random guess. We calculate a series of true positive rates (TPR or sensitivity) and false positive rates (FPR or 1−specificity) by setting different classification thresholds, and the ROC curve is plotted with the functional relationship between them. Specifically, TPR corresponds to the ratio of the successfully predicted lncRNA-disease associations to the total experimentally verified lncRNA-disease associations, and FPR refers to the percentage of candidate lncRNAs ranked below the threshold.
First, in order to estimate the influence of the addition of new types of nodes and the introduction of the disease semantic similarity on the predictions of potential associations between lncRNAs and diseases, we implemented the NBCLDA on the two constructed global networks GN 1 and GN 2 in the framework of LOOCV. The simulation results are shown in Figures 4 and 5. From Figure 4, the NBCLDA achieved an AUC of 0.8240 on GN 1 and an AUC of 0.8604 on GN 2 when the disease semantic similarity was not utilized. On the other hand, from Figure 5, an AUC of 0.8519 on GN 1 and an AUC of 0.8819 on GN 2 were achieved when the disease semantic similarity was included. This demonstrates that the prediction performance of our method not only benefits from the addition of the new types of nodes for predicting potential associations between lncRNAs and diseases, but also is significantly improved by the introduction of disease semantic similarity.
In order to further assess the performance of the NBCLDA, we compared it with other state-of-the-art models including HGLDA [27], SIMCLDA [35], MFLDA [37], Yang et al. method [26], KATZLDA [38] and TPGLDA [26] in the framework of LOOCV. For comparing with the HGLDA, a data set consisting of 183 experimentally validated lncRNA-disease associations was previously constructed and taken as the test set to evaluate its performance. Hence, for convenience, we compared our model, the NBCLDA, with the HGLDA on that data set using the framework of LOOCV. The simulation results are illustrated in Table 1 and Figure 6, from which it is evident that our approach outperformed the HGLDA. For comparing with SIMCLDA, a data set consisting of 101 known lncRNA-disease associations between 30 lncRNAs and 79 diseases was collected from the data set containing of 293 experimentally validated lncRNA-disease associations which was used in method SIMCLDA. These selected lncRNAs and diseases all belong to DS 3 in our paper. The simulation results are illustrated in Table 1, from which it is evident that our approach outperformed the SIMCLDA. While comparing with MFLDA, six relational data sources including lncRNA-miRNA associations, lncRNA-gene function associations, lncRNA-disease associations, miRNA-gene interactions, miRNA-disease associations and gene-disease associations, which were used in the method MFLDA, were collected to implement NBCLDA. The data set of experimentally validated lncRNA-disease associations was taken as the test set to evaluate its performance. The simulation results are illustrated in Table 1, from which it is evident that our approach outperformed the MFLDA.  Furthermore, we compared the NBCLDA with Yang et al.'s method based on the data set DS 3 consisting of 407 lncRNA-disease associations between 77 lncRNAs and 95 diseases. In order to make a comparison with Yang et al.'s method, according to their description, we first deleted the nodes with a degree equal to 1. As a result, we obtained a data set consisting of 319 lncRNA-disease associations between 37 lncRNAs and 52 diseases. Then, we took this data set as the test set to compare the two methods in the framework of the LOOCV. The simulation results are shown in Figure 7, from which it is seen that the NBCLDA achieved an AUC of 0.9169 while being implemented on GN 2 , which is much better than the AUC of 0.8568 achieved by Yang et al.'s method. We also compared the NBCLDA with the KATZLDA, which is a path-based method designed to predict potential lncRNA-disease associations by integrating multiple pieces of information including known lncRNA-disease associations, lncRNA expression profiles, lncRNA functional similarity, disease semantic similarity, and the Gaussian interaction profile kernel similarity. Executing the simulation, we could not obtain information on the expression profiles of corresponding lncRNAs; thus, we compared the two methods without this information. The simulation results are shown in Figure 8, which indicate that the NBCLDA achieves higher AUCs (of 0.8519 and 0.8829) than the KATZLDA with a corresponding AUC of 0.8323. This also demonstrates the superiority of our newly constructed prediction model, the NBCLDA. Finally, comparing with TPGLDA, a data set consisting of 312 experimentally validated lncRNA-disease associations including 68 lncRNAs and 67 diseases and a data set consisting of 1941 gene-disease associations between 165 genes and 67 diseases were constructed, respectively. The data set of known lncRNA-disease associations was taken as the test set to evaluate its performance. The simulation results are illustrated in Table 1, from which it is obvious that TPGLDA can achieve a better performance with an AUC of 0.92, which is higher than that of ours with the AUC value of 0.8982. The main reason that TPGLDA can achieve a better performance is probably that the contribution of resource moved in both directions are taken into consideration by a consistence-based resource allocation algorithm. However, NBCLDA does not entirely rely on known lncRNA-disease associations and can integrate multiple data sources to predict potential associations. In order to further evaluate the performance of NBCLDA, 20 percent of the known lncRNA-disease associations are randomly chosen as training set, while the remaining known and all the unknown associations are taken as testing set. We then compare with the six methods on the predicted top-k associations by using F1-score measure, which is a measure of a test's accuracy [48]. Since the sparse known lncRNA-disease associations, we set different threshold k based on the different set of known associations when comparing with other methods and the comparison results are illustrated in Table 2. From Table 2, we could see that NBCLDA outperforms several other methods in terms of F1-score. However, TPGLDA could achieve higher values than that of our approach, this is likely due to that resource moved in both directions are taken into consideration by consistence-based resource allocation algorithm. However, comparing with TPGLDA, our new method does not entirely rely on known lncRNA-disease associations and can integrate multiple data sources to predict potential associations. These advantages may be an excellent addition for biomedical research in the future.

Case Studies
To further estimate the performance of the NBCLDA, case studies of three types of lncRNA-related diseases-colorectal cancer, prostate cancer, and glioma-are analyzed in this section. During the simulation experiment, the known lncRNA-disease associations in the data set DS 3 were considered as the training samples, while the experimentally validated lncRNA-disease associations beyond DS 3 were used for testing. As for the simulation results, the top 20 disease-related lncRNAs, predicted by the NBCLDA, were verified via relevant literature, and the corresponding evidence is listed in Table 3. In addition, the predicted results of the top 20 disease-related lncRNAs were presented in the Supplementary Table S8.
Colorectal cancer (CRC) is one of the most common cancer types in western countries and its morbidity increases with age [49]. Accumulating studies have shown that lncRNAs play important roles in several steps of carcinogenesis and cancer metastasis and additionally interact with various cancers including CRC [50,51]. Therefore, we implemented the NBCLDA to discover possible CRC-associated lncRNAs. As illustrated in Table 3, seven of the top 20 lncRNAs have been validated to be related to colorectal cancer by recent biological literature, and five of them are ranked in the top 10 of the prioritized prediction results. The other two are lncRNAs SNHG16 (ranked 12th) and TUG1 (ranked 18th). For example, Chen et al. indicated that the lncRNA XIST can regulate the process of CRC development by competing for miR-200b-3p and thus it may be considered as a biomarker for prognosis [52]. Additionally, it has been demonstrated that the lncRNA MALAT1 may be considered as a potential prognostic and therapeutic target of colorectal cancer patients as it can fulfill a chemoresistant function in colorectal cancer [53]. Nakano et al. found that the epigenetic destruction and loss of imprinting of the lncRNA KCNQ1OT1 play a significant role in the occurrence of colorectal cancer [54]. Han et al. suggested that H19 can be considered as a candidate therapeutic biomarker and a new target for human CRC therapy when it is used as a growth regulator [55].
Prostate cancer is the second most common cause of cancer-related mortality in males worldwide [56]. Increasing studies show that lncRNA have become a promising target for the treatment of cancers including prostate cancer [57,58]. Hence, we carried out the NBCLDA to uncover possible prostate cancer-associated lncRNAs, and five of the top 20 predicted lncRNAs were verified and are listed in Table 3 according to the relevant literature. For example, Ren et al. evaluated the expression of MALAT1 in prostate cancer and showed that it may be considered as a perspective therapeutic target for refractory prostate cancer [59]. Zhu et al. found that the lncRNA H19 and its derived miRNA H19-miR-675 were significantly downregulated in advanced prostate cancer and they may be used for diagnostic and therapeutic treatment in advanced prostate cancer because H19-miR-675 could act as a suppressor of prostate cancer metastasis [60]. Additionally, Tian et al. showed that targeting the lncRNA NEAT1 axis could be used as a potential application in improving chemotherapy of prostate cancer [61]. Glioma is one of the most common malignant forms of brain tumors, and 6 out of 100,000 people may have gliomas [62]. Accumulating research has shown that lncRNAs play a significant role in the process of glioma development [63]. Therefore, we applied the NBCLDA to predict potential lncRNAs associated with glioma. Four of the top 20 glioma-related lncRNAs were validated by recent literature on biological experiments, and the results are illustrated in Table 3. For example, the lncRNA MALAT1 plays an important role in the progression and therapy of glioma and it may be considered an effective prognostic biomarker for the treatment of glioma [64]. Zhang et al. demonstrated that the lncRNA H19 was overexpressed in glioma tissue and cell lines, and also promotes cell proliferation of glioma [65]. Furthermore, Li et al. suggested that the lncRNA TUG1 can promote cell apoptosis of glioma cells and may act as a tumor suppressor in human glioma [66].

Discussion
Accumulating studies have indicated that lncRNAs play crucial roles in biological processes, complex disease diagnoses, prognoses, and treatments. Furthermore, computational models for predicting novel lncRNA-disease associations by integrating varieties of biological data are among the most noticeable topics. This is helpful to explore the understanding of disease mechanisms at the lncRNA level. In this paper, we construct a global tripartite network and a quadruple network by integrating various biological information and propose a novel approach, the NBCLDA, to predict potential lncRNA-disease associations by applying the naïve Bayesian classifier into the two constructed networks. Compared with current models, the NBCLDA does not entirely rely on known lncRNA-disease associations, and can achieve a reliable performance with effective AUCs in the LOOCV framework. This means that our method can not only predict the possible associations between lncRNAs and diseases included in the known associations set, but can also predict the potential associations whose elements are not in the known data set.
To evaluate the predictive performance of our method, the LOOCV is implemented based on the experimentally verified lncRNA-disease associations obtained from the MNDR database. Simulation experiment results of the NBCLDA show a strong performance and its predictive accuracy has been significantly improved by the addition of new types of nodes and the disease semantic similarity for predicting potential associations between lncRNAs and diseases. It also shows that the NBCLDA can achieve better performance than the other three state-of-the-art models with more effective AUCs in the framework of the LOOCV. Moreover, in order to further estimate the performance of the NBCLDA, case studies of colorectal cancer, prostate cancer, and glioma were implemented in this paper. These simulation results demonstrated that the NBCLDAs can be an excellent tool for future biomedical research.
Despite the reliable experimental results of the NBCLDA, there are also some biases in our method. For example, the known experimentally validated lncRNA-disease associations are still limited. Therefore, the prediction performance of the NBCLDA would be improved by a more comprehensive data set. Furthermore, the data sources in this paper need to be strictly preprocessed according to the proposed method, which restricts the richness of the data sources to a certain extent.

Conclusions
In this paper, we mainly summed up the following contributions: (1) we constructed a global tripartite network by integrating a variety of biological information including miRNA-disease, miRNA-lncRNA and lncRNA-diseases associations and interactions; (2) we constructed a global quadruple network by appending gene-lncRNA interaction, gene-disease association, and gene-miRNA interaction networks to the global tripartite network; (3) we developed a novel approach NBCLDA based on the naïve Bayesian classifier and applied it into the two global networks to predict potential lncRNA-disease associations; (4) we appended the disease semantic similarity into our newly constructed prediction model NBCLDA to further uncover the potential relationships between lncRNAs and diseases; (5) NBCLDA can not only predict the possible associations between lncRNAs and diseases included in the known associations set, but can also predict the potential associations whose elements are not in the known data set; (6) NBCLDA can integrate multiple heterogeneous biological data for discovering potential relationships between lncRNAs and diseases; (7) in the future work, more biological data can be collected and pre-processed to be utilized in the newly proposed method for predicting potential lncRNA-disease associations.
Supplementary Materials: The following are available at www.mdpi.com/xxx/s1, Supplementary Table S1: The known miRNA-disease associations of the data set DS 1 consisting of 4704 miRNA-disease interactions which were collected from the HMDD database; Supplementary Table S2: The known miRNA-lncRNA associations of the data set DS 2 consisting of 9086 miRNA-lncRNA interactions which were collected from the starBase v2.0 database; Supplementary Table S3: The known lncRNA-disease associations of the data set DS 3 consisting of 407 lncRNA-disease associations which were downloaded from the MNDR v2.0 database; Supplementary  Table S4: The known gene-disease associations of the data set DS 4 consisting of 3702 gene-disease associations which were gathered from the DisGeNET v5.0 database; Supplementary Table S5: The known gene-lncRNA associations of the data set DS 5 consisting of 411 gene-lncRNA interactions which were downloaded from the LncACTdb database; Supplementary Table S6: The known gene-miRNA associations of the data set DS 6 consisting of 565 gene-miRNA association was obtained from the miRecords database; Supplementary Table S7: The Disease tree numbers of the data set DS 7 consisting of 373 diseases with their disease tree numbers which were gathered from the MeSH database; Supplementary Table S8 Acknowledgments: The authors thank the anonymous referees for suggestions that helped improve the paper substantially.

Conflicts of Interest:
The authors declare that there are no conflicts of interest regarding the publication of this paper.