A Probabilistic Matrix Factorization Method for Identifying lncRNA-Disease Associations

Recently, an increasing number of studies have indicated that long-non-coding RNAs (lncRNAs) can participate in various crucial biological processes and can also be used as the most promising biomarkers for the treatment of certain diseases such as coronary artery disease and various cancers. Due to costs and time complexity, the number of possible disease-related lncRNAs that can be verified by traditional biological experiments is very limited. Therefore, in recent years, it has been very popular to use computational models to predict potential disease-lncRNA associations. In this study, we constructed three kinds of association networks, namely the lncRNA-miRNA association network, the miRNA-disease association network, and the lncRNA-disease correlation network firstly. Then, through integrating these three newly constructed association networks, we constructed an lncRNA-disease weighted association network, which would be further updated by adopting the KNN algorithm based on the semantic similarity of diseases and the similarity of lncRNA functions. Thereafter, according to the updated lncRNA-disease weighted association network, a novel computational model called PMFILDA was proposed to infer potential lncRNA-disease associations based on the probability matrix decomposition. Finally, to evaluate the superiority of the new prediction model PMFILDA, we performed Leave One Out Cross-Validation (LOOCV) based on strongly validated data filtered from MNDR and the simulation results indicated that the performance of PMFILDA was better than some state-of-the-art methods. Moreover, case studies of breast cancer, lung cancer, and colorectal cancer were implemented to further estimate the performance of PMFILDA, and simulation results illustrated that PMFILDA could achieve satisfying prediction performance as well.


Introduction
Long non-coding RNAs (lncRNAs) are a class of important heterologous ncRNAs that differ in length from miRNAs by more than 200 nucleotides [1]. For a long time, lncRNAs have been considered to be transcriptional noise, and only recently have these views been changed by increasing evidence [2]. Related studies have shown that lncRNA plays an indispensable role in many biological processes, such as chromatin remodeling, gene transcription, protein transport and trafficking, and epigenetic regulation [3][4][5][6][7][8][9]. In addition, the dysregulation of lncRNA in coronary artery disease, autoimmune disease, neurological disorder, and various cancers suggests that lncRNA plays an important role in many complex diseases [10]. Recently, lncRNAs are increasingly attracting the attention of researchers in the field of bioinformatics [10][11][12][13].
With the rapid development of high-throughput sequencing technology, thousands of lncRNAs have been discovered in mammalian transcriptions. Numerous studies have also revealed the important role of lncRNA in biological processes and the significant effects in complex human diseases [1]. There is no doubt that lncRNAs are closely related to complex human diseases, and more importantly, some lncRNA-disease associations have been experimentally confirmed. For example, the expression of XIST is up-regulated in glioma tissues and GSCs. Functionally, XIST knockdown exerts tumor suppressor function by reducing cell proliferation, migration and invasion, and inducing apoptosis [14]. LncRNA HOTAIR is highly expressed in prostate cancer and is associated with the growth and aggressiveness of prostate cancer cells [15]. Hence, it is meaningful to identify as many potential lncRNA-disease associations as possible. However, up to now, due to the high costs of traditional biological experiments, the lncRNA-disease associations supported by biological experiments are still very limited. Therefore, it is highly desirable to develop effective computational models to predict potential lncRNA-disease associations. In recent years, some computational models have been developed already, and all these models can be approximately divided into three different categories such as the machine learning-based models, biological network-based models and the models without relying on known lncRNA-disease associations [16].
As for the machine learning-based models, Chen Xing et al. proposed a computational model called LRSLDA to predict potential lncRNA-diseases associations [17] through acquiring two different scores from lncRNA space and disease space simultaneously for the same lncRNA-disease pair. Huang et al. proposed a prediction model called ILNCSIM by combining the LRSLDA, lncRNA functional similarity and disease semantic similarity to calculate the probabilities of lncRNA-disease associations [18]. Zhao et al. developed a Bayesian classifier-based model to identify new cancer-associated lncRNAs by using known cancer-associated lncRNAs such as multivariate data, genome, regulatory protein and transcription data integration [19].
As for the biological network-based models, based on the assumption that lncRNAs with similar functions are often associated with phenotype-like diseases, Sun et al. proposed a model called RWRlncD based on the lncRNA-lncRNA function similarity network [20]. Through integrating known lncRNA expression profiles, lncRNA-disease associations, lncRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity, Chen et al. developed a prediction model called KATZLDA to discover potential lncRNA-disease associations [21].
Among these machine-learning-based models and biological network-based models mentioned above, one of their common features is that known lncRNA-diseases relationships are required during the implement of prediction. However, so far, due to the time complexity and high costs of traditional biological experiments, the experimentally identified known lncRNA-disease associations are still very limited. Hence, some computational models that do not rely on known lncRNA-disease associations have been proposed in recent years. For instance, Liu et al. proposed a model based on the intermediate node genes to predict the potential disease-related lncRNAs [22]. Chen et al. proposed a model called HGLDA based on integrating miRNA-disease associations and lncRNA-miRNA interactions to discover novel lncRNA-disease associations [23].
In this paper, unlike the most advanced prediction models described above, a new model based on probability matrix decomposition called PMFILDA is proposed to discover potential lncRNA-disease associations. At present, matrix decomposition has been widely used in the field of bioinformatics. For example, in the prediction of miRAN-disease correlation, Chen et al. proposed to predict miRNA-disease correlation based on induction matrix complementation, matrix decomposition and heterogeneous graphs [24,25]. Zhao et al. proposed a method based on symmetric non-negative matrix factorization and Kronecker regularized least squares to predict the correlation of miRNA-disease [26]. The difference between our PMFILDA method and above-mentioned models is that we first constructed three kinds of binary association networks based on experimentally validated lncRNA-miRNA associations, miRNA-disease associations, and lncRNA-disease associations separately. Then, based on these three newly constructed association networks, we constructed a weighted lncRNA-disease association network. Moreover, based on the semantic similarity of disease and the functional similarity of lncRNA, we further adopted the KNN algorithm [27] to update the weighted lncRNA-disease association network. Then, according to the updated weighted lncRNA-disease association network, we decomposed the weight matrix of lncRNA-disease into low-order characteristic matrices U and V of the lncRNAs and diseases based on the probability matrix factorization. Finally, the product of U and V would be used to predict the scores of lncRNA-disease pairs. The flowchart of our prediction model PMFIDLA is shown in the following Figure 1. In Subgraph A of above Figure 1, the lncRNA-miRNA association network is constructed based on known lncRNA-miRNA associations downloaded from starbase [28]. Nodes that are linked by solid lines indicate that they are associated. In Subgraph B, the miRNA-disease association network is constructed based on known miRNA-disease associations downloaded from HMDD [29]. Nodes that are linked by solid lines indicate that they are associated. In Subgraph C, the lncRNA-disease association network is constructed based on known lncRNA-disease associations downloaded from MNDR v2.0 [30]. Nodes that are linked by solid lines indicate that they are associated. In Subgraph D, the lncRNA-disease weight network is constructed based on Subgraph A, Subgraph B, and Subgraph C. Nodes that are linked by solid lines indicate that they are related. The blue dashed lines indicate the weight of the initial assignment between nodes. Subgraph E is a network of disease semantic similarity and the numbers in E are similarity scores. Subgraph F is a network of lncRAN functional similarity and the numerics in F are similarity scores. Subgraph G is a lncRNA-disease weighting network that has been updated and the red dashed line indicates weights having been redistributed between nodes. Subgraph H is the lncRNA-disease associations that are ultimately predicted by our method, and the solid red lines indicate the predicted lncRNA-disease associations with relatively high rankings. The KNN is a K-nearest neighbor algorithm used to find the most similar nodes. The PMF is a probability matrix factorization algorithm.

Materials
Since known lncRNA-disease associations were considered in our prediction model PMFIDLA, in this section, we download three kinds of gold standard datasets consisting of known lncRNA-miRNA associations, miRNA-disease associations, and lncRNA-disease associations from relevant authoritative databases, respectively.

Human LncRNA-MiRNA Associations and MiRNA-Disease Associations
Firstly, we downloaded the datasets of experimentally validated known miRNA-disease associations and lncRNA-miRNA associations from the two authoritative databases such as HMDD [29] and starbase [28] separately. Then, after having further unified the names of miRNAs in these two datasets, we could obtain 246 common miRNAs from both of these two datasets. For convenience, we denoted the set of these shared miRNAs as con_M. Thereafter, based on these 246 shared miRNAs in con_M, we finally downloaded 4704 different miRNA-disease associations and 9086 different lncRNA-miRNAs associations from above two authoritative databases. In addition, for convenience, we denoted the set of these 4704 different miRNA-disease associations as MD and the set of these 9086 different lncRNA-miRNAs associations as LM separately. Moreover, through statistics, there are 373 different diseases in MD and 1089 different lncRNAs in LM respectively (see Supplementary Materials Tables S1 and S2).

Human LncRNA-Disease Associations
Secondly, we downloaded the dataset of experimentally validated known lncRNA-disease associations from the MNDR v2.0 database [30], and for convenience, we denoted the dataset of these downloaded known lncRNA-disease associations as LD. Furthermore, to adapt the downloaded data to our prediction model, we would further process the original data as follows: Step 1: Obtaining the set of different lncRNAs shared in both LD and LM. In addition, for convenience, we denoted the set of these shared lncRNAs as con_L.
Step 2: Obtaining the set of different diseases existed in MD and LD. In addition, for convenience, we denoted the set of these shared diseases as con_D.
Step 3: Obtaining the set of lncRNA-disease associations with both lncRNAs in con_L and diseases in con_D based on the set of LD.
And as a result, we finally obtained 407 different lncRNA-disease associations including 77 different lncRNAs and 95 different diseases(see Supplementary Materials Tables S3).

Construction of the lncRNA-miRNA Association Network and miRNA-Disease Association Network
Based on the set of LM and MD, we can construct the lncRNA-miRNA association network and miRNA-disease association network according to the following steps respectively: Step 1: Supposing that there are n l different lncRNAs in LM, and for convenience, we denote the set of these lncRNAs as L = {l 1 , l 2 , . . . , l n l }.
Step 2: Supposing that there are n d different diseases in MD, and for convenience, we denote the set of these diseases as D = {d 1 , d 2 , . . . , d n d }.
Step 3: Supposing that there are n m different common miRNAs existed in both MD and LM, and for convenience, we denote the set of these miRNAs as con_M = {m 1 , m 2 , . . . , m n m }.
Step 4: Hence, we can firstly obtain an lncRNA-miRNA association network G LMN = (L, con_M, E lm ), where E lm denotes the set of experimentally verified known associations in LM. For ∀l i ∈ L, m j ∈ M, we define that there is an edge e l i −m j between l i and m j in E lm if and only if there is an experimentally verified known associations between l i and m j in LM.
Step 5: Simultaneously, we can also obtain an miRNA-disease association network G MDN = (con_M, D, E md ), where E md denotes the set of experimentally verified known associations in MD. For ∀m i ∈ con_M, d j ∈ D, we define that there is an edge e m i −d j between m i and d j in E md if and only if there is an experimentally verified known associations between m i and d j in MD.

Construction of the Weighted lncRNA-Disease Association Network
Based on the newly constructed association networks such as G LMN and G MDN , we can further obtain a weighted lncRNA-disease association network G LDW N = (L, D, E ld , W ld ), where E ld denotes the set of edges between different lncRNAs in L and diseases in D. For ∀l i ∈ L, d j ∈ D, we define that there is an edge e l i −d j between l i and d j in E ld if and only if there is at least one miRNA m k in con_M with experimentally verified known associations with both l i in LM and d j in MD simultaneously. In addition, then the weight w l i −d j corresponding to e l i −d j can be calculated according to the following steps: Step 1: Supposing that there are T different miRNAs in con_M with experimentally verified known associations with both l i in LM and d j in MD simultaneously, and for convenience, we denote the set of these T different miRNAs as CM = {m 1 , m 2 , . . . , m T }.  (1):

Disease Semantic Similarity Measure
Considering that the similarity between disease pairs can calculated by their directed acyclic graphs (DAGs) [31], while estimating the semantic similarity of diseases, for any given disease, we will firstly express it as its directed acyclic graph (DAG), and as illustrated in the following Figure 2, in its corresponding DAG, all annotated terms associated with this disease will be contained. For instance, in Figure 2 the DAGs of two different diseases such as Breast Neoplasms (d 1 ) and Liver Neoplasms (d 2 ) are shown, and it is obvious that the DAG of d 1 can be denoted as where T d 1 denotes all the ancestor nodes of "d 1 " and itself, and E d 1 represents the set of edges in DAG d 1 . Moreover, for any disease d in DAG d 1 , its semantic contribution to d 1 can be calculated according to the following Formula (2): where ∆ will be set to 0.5 based on the suggestion proposed by the state-of-the-art literature [31]. Moreover, in the same way, it is easy to see that the DAG of d 2 can be denoted as , and then, for any given diseases d 1 and d 2 , the semantic similarity between them can be measured according to the following Formula (3) obviously: Figure 2. The DAGs of the disease Breast Neoplasms and Liver Neoplasms. In addition, the disease term and its identification numbers are included in corresponding node. The common terms of the two diseases are illustrated by green nodes.

LncRNA Similarity Measure
The functional similarity between lncRNAs measures how similar their functions will be. In this section, based on the method proposed by the state-of-the-art literature [31], for any given lncRNAs l i and l j , Supposing that l i and l j have known associations with a group of diseases GD(l i ) = d i1 , d i2 , . . . , d ip and GD(l j ) = d j1 , d j2 , . . . , d jq in LD respectively, then the functional similarity between them can be measured according to the following Formula (4):

Weight Redistribution in G LDW N Based on the KNN Algorithm
Based on the above descriptions, it is easy to see that we can represent the network G LDW N with its weight matrix W ld , where W ld [i][j] = w l i −d j . Moreover, considering that known lncRNA-disease associations are very sparse, which may cause that there exist some lncRNAs with no associations with any diseases, or some diseases with no associations with any lncRNAs. Hence, some potential associations between predicted lncRNAs and diseases will be invalid. Therefore, in this paper, we will rebuild the weight matrix W ld to solve this kind of problem as follows: Step 1: Firstly, representing the ith row of the weight matrix W ld as W ld (l i , : and the jth column of the weight matrix W ld as W ld (: Step 2: Then, for any given lncRNA l q and any l i in L other than l q , based on above Formula (4), it is obvious that we can obtain the functional similarity FS(l i , l q ) easily, and moreover, after sorting these values of functional similarities between l q and all remaining lncRNAs other than l q in descending order, then we can obtain the corresponding lncRNAs from the first K elements in the sorted results. For convenience, let l 1 , l 2 , . . . , l K denote these K lncRNAs, then the qth row of W ld can be updated according to the following Formula (6): where α ∈ (0, 1] is a decay factor, which means that a higher decay will be assigned to l i if it is more dissimilar to l q , and NL = ∑ i∈[1,K] FS(l i , l q ) is the normalization factor, which is used for normalization of the value of W ld (l q , :). Additionally, in similar way, it is obvious that the pth column of W ld can also be updated according to the following Formula (7): where d 1 to d K denote the top K diseases most similar to d p , β ∈ (0, 1] is the decay factor, and

Standard Matrix Factorization
Up to now, the matrix decomposition technology is widely used in the field of recommended systems, since not only the computational complexity can be reduced by matrix decomposition, but also good performance can be achieved in solving the matrix scarcity problem. The standard matrix decomposition aims to find two low-ranking, latent feature matrices whose products are used to fit the original matrix. Hence, for the weight matrix W ld ∈ R n l ×n d constructed above, it is obvious that we can decompose W ld into two different matrices U ∈ R n l ×k and V ∈ R n d ×k (k min(n l , n d )), and there is W ld ≈ UV T . Thereafter, the problem of disease-related lncRNA prediction can be further expressed by the following Formulas (8) and (9): where the row vectors U i and is V j represent the ith lncRNA-specific and jth disease-specific latent feature vectors respectively. In addition, obviously, the above Formulas (8) and (9) form a convex optimization problem, which can be solved by some existing optimization algorithms such as the iterative update algorithm [32] easily.

Probabilistic Matrix Factorization
Since the probability matrix decomposition is based on the decomposition of the standard matrix, supposing that W ld is a positive distribution with Gaussian noise, then we can define the conditional distribution over the W l d as: (10) where N(W ld (i, j)|U i V T j , σ 2 ) is the probability distribution function of the normal distribution and Obviously, p(W ld |U, V, σ 2 ) is the likelihood function (i.e., the product of all the weights).
In addition, supposing that the matrices U and V satisfy the Gaussian prior to a mean of 0, then the priors of U and V can be denoted as follows: Here, the matrix I is a T × T dimensional unit diagonal matrix. Assuming that U and V are independent of each other, then the posterior distribution of U and V can be obtained by following Formula (13): Then the log of the posterior distribution over the features of lncRNAs and diseases can be calculated as follows: Here, C is a constant factor. In addition, as for N(U i |0, σ 2 U I), since there is: Hence, considering that the matrix I is an unit diagonal matrix, which means that there is (σ 2 U I) −1 = 1 σ 2 U I, then we have: Here, C U is a constant factor. Similar to the above analyses, we can also have: Therefore, it is obvious that maximizing the log-posterior on U and V with hyper-parameters being kept fixed in Formula (13) will be equivalent to minimizing the following objective function: where is the Hadamard product, and || · || F represents the Frobenius norm.

Optimization
Based on the properties of Frobenius norm, the Formula (19) can be rewritten as the form of the LaGrangian function as follows: Based on above Formula (20), we can further obtain its partial derivatives with respect to U and V as follows: Therefore, we can construct the update rules based on the gradient descent algorithm as follows: where λ m is the momentum parameter, which can accelerate the convergence speed of U and V, the parameter λ denotes the learning rate, and based on the suggestion proposed by the state-of-the-art literature [31] (Wang et al., 2010), λ m and λ will be set to 0.8 and 0.005 respectively [33]. Hence, based on above update rules illustrated in Formulas (23) and (24), we can update the lncRNA-specific and disease-specific latent feature matrix U and V until they become converged. Then, we can finally obtain the predicted lncRNA-disease association matrix W ld = FS × UV T × SD. In addition, as for any column d i in W ld , we can sort the elements (i.e., lncRNAs) in d i in descending order, then the top-ranked lncRNAs in d i can be predicted as d i -related lncRNAs, while the bottom-ranked lncRNAs in d i can be predicted as d i -disrelated lncRNAs at the same time.

Performance Evaluation Metrics
To evaluate the robustness and prediction performance of PMFILDA, in this section, the Leave One Out Cross-Validation (LOOCV) was implemented based on the experimentally verified lncRNA-disease associations. In LOOCV, each pair of known lncRNA-disease associations is used as a validation set, while other known lncRNA-disease associations are used as training sets. Moreover, all the lncRNA-disease pairs without experimentally verify are used as candidate samples. The ranking of the test sample relative to the candidate sample needs to be evaluated after the implementation of PMFILDA. When a threshold is given, if the test sample ranks above the given threshold, then we will regard that a correctly positive sample has been predicted by PMFILDA, otherwise we will regard that a correctly negative sample has been predicted by PMFILDA. Moreover, while different thresholds are set, a series of True Positive Rate (TPR) and False Positive Rate (FPR) can also be obtained according to the following formulas: where TP and TN denote the number of positive and negative samples that have been correctly identified, while FP and FN represent the number of positive and negative samples that have been incorrectly identified. Hence, the Receiver Operating Characteristic (ROC) curve can be drawn by plotting TPRs versus FPRs, and the area under ROC curve (AUC) can be further calculated to measure the global performance of PMFILDA. Obviously, the closer the value of AUC is to 1, the more robust the prediction model would be. Moreover, during simulation, to eliminate the random errors caused by the random initialization of U and V, we repeated our experiments 100 times and took the mean and variance of AUCs as our final results, which were shown in the following Figure 3. In addition, from Figure 3, it is easy to see that our newly proposed prediction model PMFILDA can achieve the mean AUC of 0.8794 and the standard deviation of 0.0011. Next, to further evaluate the performance of PMFILDA, based on the framework of LOOCV, we compared PMFILDA with some state-of-the-art models such as NBCLDA [34], HGLDA [23], and the method proposed by Yang et al. [35]. Similarly, during simulation, to eliminate the random errors caused by the random initialization of U and V, we repeated our experiments 50 times and took the mean of AUCs as our final results, which were shown in the following Table 1. In addition, while comparing PMFILDA with the NBCLDA, considering that we did not consider the genes in our method, then we compared PMFILDA with the NBCLDA_GN 1 only, and the simulation results are shown in Table 1 and the following Figure 4. Obviously, from Table 1 and Figure 4, it is easy to see that our newly proposed prediction model PMFILDA can achieve a reliable AUC of 0.8793 that is much higher than the AUC of 0.8519 achieved by NBCLDA_GN 1 . Moreover, while comparing PMFILDA with the HGLDA, in order to make a fair comparison, we implemented LOOCV on both PMFILDA and HGLDA based on the same dataset, i.e., we used the same 183 known lncRNA-disease associations proposed by HGLDA in the comparison simulation, and the simulation results are shown in Table 1 and the following Figure 5. Obviously, from Table 1 and Figure 5, it is easy to see that our newly proposed prediction model PMFILDA is superior to HGLDA.  Table 1 and the following Figure 6. Obviously, from Table 1 and Figure 6, it is easy to see that our newly proposed prediction model PMFILDA can achieve a reliable AUC of 0.9090 that is much higher than the AUC of 0.8568 achieved by Yang et al.

Contribution Analysis of lncRNA-Disease Associated Network
In our method, we constructed a weighted lncRNA-disease association networks based on the known lncRNA-disease, microRNA-disease and lncRNA-microNA association networks. It may be useful to discuss the contribution of lncRNA-disease associations separately here. Hence, without considering the relationship between lncRNA and disease, we constructed a weighted network of lncRNA-disease through using known lncRNA-microRNA associations and microRNA-disease associations only. Based on the steps in Section 3.2, we finally obtained 304 lncRNA-disease associations including 60 lncRNAs and 73 diseases. Thereafter, we further obtained the corresponding weight matrix W ld , and then performed LOOCV 100 times on the PMFILDA. The results were shown in the following Figures 7 and 8, obviously, the AUC value achieved by PMFILDA based on three association networks can be increased by 0.0763 than the AUC value achieved by PMFILDA based on two association networks.

The Effects of KNN on Performance
Considering that known lncRNA-disease associations are very sparse, there may exist some lncRNAs with no associations with any diseases, or some diseases with no associations with any lncRNAs. Hence, some potential associations between predicted lncRNAs and diseases will be invalid. Therefore, in this paper, we rebuilt the weight matrix W ld based on KNN algorithm to solve this kind of problem.
Here, we also investigated the influence of KNN algorithm on our method from two aspects. One is that we do not use KNN algorithm to deal with the weighted network G LDMN , the other is to update the weighted network G LDMN with other algorithms, such as K-means algorithm. When PMFILDA is directly executed without using KNN algorithm to process the weighted network G LDMN , the result is shown in Figure 9. It is easy to see that PMFILDA could achieve an average AUC of 0.8794 while the weight of G LDMN was reallocated; however, while the weight of G LDMN was not reallocated, PMFILDA can achieve an average AUC of 0.8042 only, which demonstrated that it can improve the performance of our model through adopting the KNN algorithm to re-allocate the weight of G LDMN . And in addition, to estimate the impacts of other algorithms, we selected the K-means algorithm for further testing. After performing LOOCV 100 times, we presented the simulation results in the following Table 2, and from observing the results in Table 2, it is easy to see that the performance of KNN is better than K-means.

Parameter Sensitivity Analysis
From above descriptions, it is easy to see that to improve the prediction performance of PMFILDA, some parameters have been introduced in the model construction of PMFILDA, whose values will need to be finalized by the training of the prediction model. For example, how to choose the value of the parameter K while adopting the algorithm of K-nearest neighbor? How to choose the attenuation coefficients α and β given in the Formulas (6) and (7)? How to choose the value of the parameter T while adopting Formula (1) to implement the matrix decomposition? and so on. Hence, firstly, to evaluate the impacts of the parameters K, α and β to the performance of our model PMFILDA, during simulation, we will set K from 2 to 10 and α from 0.1 to 0.9, respectively. Moreover, we will set α = β for convenience. The detailed simulation results were shown in the Supplementary Table S4. In addition, through experimental results, it is easy to know that our model PMFILDA can achieve the highest AUC of 0.9204 while K = 10 and α = β = 0.8 in the LOOCV framework. Next, to estimate the impacts of the parameter T to the performance of our model PMFILDA, during simulation, we will set T from 10 to 50 and the step size to 10. In addition, through experimental results, it is easy to know that our model PMFILDA can achieve the highest AUC of 0.9210 while T = 20 in the LOOCV framework.

Case Studies
In this section, we implemented case studies based on the optimal settings of above parameters to further verify the prediction performance of PMFILDA. During simulation, for each given disease, its potentially relevant lncRNAs predicted by PMFILDA will be sorted according to their predicted scores in descending order. In addition, as a result, the top 20 predicted lncRNAs related to the disease potentially will be recorded in the Supplementary Table S5, and then, two public databases such as MNDR V2.0 and LncRNA-Disease database will be used to confirm these potential associations between the given disease and each of these 20 predicted lncRNAs. In this section, we selected three kinds of common diseases such as breast cancer, lung cancer, and colorectal cancer as the targets of our case studies.
As for breast cancer, according to the reports of relevant literatures, it is very common in the group of women [36,37] and may be caused by a variety of molecular alterations. For example, studies have shown that the formation of breast tumors is closely related to lncRNA [38,39]. Hence, predicting breast cancer-associated lncRNA and identifying lncRNA markers are important for the diagnosis and treatment of breast cancer [39]. In this section, we will implement PMFILDA to discover the potential breast cancer-associated lncRNAs. In addition, as shown in the following Table 3, it is easy to see that 12 of the top 20 breast cancer-related lncRNAs predicted by our model PMFILDA have been confirmed in authoritative databases. For example, MALAT1, HOTAIR and H19 ranked the 1st, 2nd and 3rd in the list of our predicted results respectively, and among them, it is proved that MALAT1 has functional and prognostic significance as a metastasis driver in ER negative and lymph node negative breast cancer [40], HOTAIR will be overexpressed in approximately one quarter of human breast cancers and increased in expression in primary breast tumors and metastases [41], and the down-regulation of H19 will significantly reduce colony formation and anchorage-independent growth of breast and lung cancer cells [42].
Moreover, in recent years, lung cancer is a leading cause of cancer-related deaths worldwide, regardless of gender. According to the disease patterns and treatment strategies, it can be roughly divided into non-small cell lung cancer (NSCLC) and small cell lung cancer [43]. To diagnose and treat lung cancers more effectively, researchers have paid lots of attention to the deregulation of protein-coding genes in the past few decades to identify oncogenes and tumor suppressors [43][44][45]. However, recent studies have shown that lncRNAs play a significant role in the development and progression of lung cancers [43,45]. Hence, in this section, we will implement PMFILDA to infer the potential lung cancer-related lncRNAs. In addition, as illustrated in the following Table 3, it is easy to see that 14 of the top 20 potential lung cancer-related lncRNAs predicted by our model PMFILDA have been confirmed by authoritative biological experiments. For instance, MALAT1, HOTTIP and MEG3 ranked the 3rd, 4th and 5th in the list of our predicted results respectively, and among them, it is identified that MALAT1 is highly correlated with lung cancer metastasis [46,47], will promote lung cancer cell movement by regulating motor-related gene expression [48], and can be an important biomarker for the development of lung cancer metastasis [49]. Additionally, it is demonstrated that through knocking out HOXA13 by RNA interference (siHOXA13), HOTTIP can promote lung cell proliferation, migration, and inhibition of apoptosis, which could serve as a new biomarker and a therapeutic target for NSCLC intervention [50]. Moreover, as for MEG3, it is proved that the down-regulation of MEG3 will enhance cisplatin resistance of lung cancer cells through activation of the WNT/β-catenin signaling pathway [51]. Additionally, the colorectal cancer (CRC) has a high incidence in Western countries in recent years [52], and more and more research indicates that lncRNAs play a significant role in the formation of CRC [44,45]. Hence, in this section, we will implement PMFILDA to predict the potential CRC-related lncRNAs. In addition, as shown in the following Table 2, it is easy to see that eight of the top 20 CRC-related lncRNAs predicted by our model PMFILDA have been confirmed by authoritative biological experiments. For instance, MALAT1, NEAT1 and TUG1 ranked the 2nd, 8th and 10th in the list of our predicted results respectively, and among them, it is identified that MALAT1 may be a potential predictor of tumor metastasis and prognosis, and that the interaction between MALAT1 with SFPQ may be a new therapeutic target for CRC [53]. In addition, it is proved that NEAT1 can be used as an indicator of tumor recurrence and colorectal cancer prognosis [54], and the expression of NEAT1 in CRC may play a carcinogenic role in the differentiation, invasion, and metastasis of CRC, hence, the whole blood NEAT1 expression can be used as a new diagnostic and prognostic biomarker for overall survival in CRC [55]. Moreover, it is demonstrated that the tumor expression of TUG1 plays an important role in colorectal cancer metastasis, and TUG1 can be used as a biomarker or therapeutic target for potential CRC [56].

Discussion
Increasing research has shown that lncRNAs play a crucial role in the occurrence, formation, diagnosis, treatment, and prognosis of diseases. The discovery of complex disease-associated lncRNAs as biomarkers based on existing biological experiments is not only costly but also requires a large amount of clinical data. Therefore, it is a future trend to integrate potential biological data resources and use developed computers to develop efficient and accurate computational models to predict potential new disease-related lncRNAs. In this paper, we proposed a novel computational model called PMFILDA to predict potential disease-associated lncRNAs. In this model, we first integrated known lncRNA-miRNA associations, miRNA-disease associations, and a small number of known lncRNA-disease associations into a new weighted lncRNA-disease association network. Then, based on the newly constructed association network, through adopting the semantic similarity of the disease, the functional similarity of lncRNA and the KNN algorithm to update the weight network, an lncRNA-disease association matrix W ld can be obtained. Hence, through adopting the probability matrix decomposition scheme to decompose the matrix W ld into the feature matrix U of lncRNA and the feature matrix V of the disease, we can finally construct our model PMFILDA based on the two feature matrices to predict the potential associations between lncRNAs and diseases. Compared to existing state-of-the-art models, simulation results have demonstrated that our model PMFILDA has better prediction performance. Moreover, case studies of breast cancer, lung cancer and colorectal cancer also indicated that PMFILDA can be used as a superior computational model to predict potential lncRNA-disease associations. However, it is obvious that there are still some biases in our model. When we only use lncRNA-disease associations and regardless of any miRNAs, the performance of PMFILDA may be reduced. To illustrate this situation, we did the following experiment. After processing the data, we obtained 246 pairs of lncRNA-disease associations, including 44 lncRNAs, 68 diseases. Then we performed 100 LOOCVs on the PMFILDA method, and the average AUC value was 0.8111, and the standard deviation was 0.0073. When we used miRNA, the average AUC value was 0.8794 and the standard deviation was 0.0011. The reason for this difference is that when we don , t consider miRNAs, the information we use for lncRNA-disease may be incomplete. There may be some important associations that do not exist in the lncRNA-disease data set. When the miRNA node is added, these important relationships can be re-established. Therefore, in our model, we need to consider not only the lncRNA-disease relationship, but also the nodes that can improve the lncRNA-disease relationship.

Conclusions
In this study, our major contributions are as follows: Firstly, we constructed a novel weighted lncRNA-disease association network through integrating the known lncRNA-miRNA association network, the known miRNA-disease association network and the known lncRNA-disease association network. Secondly, based on the semantic similarity of disease and the similarity of lncRNA function, we adopted the KNN algorithm to update the newly constructed weighted lncRNA-disease association network. Thirdly, based on the probability matrix decomposition model, we proposed a novel computational model called PMFILDA to predict potential lncRNA-disease associations, which cannot only predict the potential associations between lncRNAs and disease contained in the experimentally validated lncRNA-disease associations, but also predict the potential associations of its elements in unknown datasets. To improve the efficiency of our model, in the future, we plan to integrate more intermediate nodes such as genes to update the weighted lncRNA-disease association network. In addition, we also believe that the results [25,[57][58][59][60][61][62][63] of the miRNA-disease association prediction field will promote the development of lncRNA-disease correlation prediction. Moreover, while studying the association prediction of lncRNA-disease, focusing on the research results in other fields will also broaden our horizons.