Prediction of Disease-related microRNAs through Integrating Attributes of microRNA Nodes and Multiple Kinds of Connecting Edges

Identifying disease-associated microRNAs (disease miRNAs) contributes to the understanding of disease pathogenesis. Most previous computational biology studies focused on multiple kinds of connecting edges of miRNAs and diseases, including miRNA–miRNA similarities, disease–disease similarities, and miRNA–disease associations. Few methods exploited the node attribute information related to miRNA family and cluster. The previous methods do not completely consider the sparsity of node attributes. Additionally, it is challenging to deeply integrate the node attributes of miRNAs and the similarities and associations related to miRNAs and diseases. In the present study, we propose a novel method, known as MDAPred, based on nonnegative matrix factorization to predict candidate disease miRNAs. MDAPred integrates the node attributes of miRNAs and the related similarities and associations of miRNAs and diseases. Since a miRNA is typically subordinate to a family or a cluster, the node attributes of miRNAs are sparse. Similarly, the data for miRNA and disease similarities are sparse. Projecting the miRNA and disease similarities and miRNA node attributes into a common low-dimensional space contributes to estimating miRNA-disease associations. Simultaneously, the possibility that a miRNA is associated with a disease depends on the miRNA’s neighbour information. Therefore, MDAPred deeply integrates projections of multiple kinds of connecting edges, projections of miRNAs node attributes, and neighbour information of miRNAs. The cross-validation results showed that MDAPred achieved superior performance compared to other state-of-the-art methods for predicting disease-miRNA associations. MDAPred can also retrieve more actual miRNA-disease associations at the top of prediction results, which is very important for biologists. Additionally, case studies of breast, lung, and pancreatic cancers further confirmed the ability of MDAPred to discover potential miRNA–disease associations.


Introduction
MicroRNAs (miRNAs) are small noncoding, single-stranded RNAs encoded by endogenous genes with a length of approximately 22-24 nucleotides [1][2][3][4]. MiRNAs play important regulatory roles by targeting messenger RNA for splicing or translational inhibition in animals and plants [5]. Increasing evidences shows that miRNAs are involved in the development and progression of many diseases [6][7][8][9]. Therefore, identifying the regulatory relationships between diseases and miRNAs can help researchers explore the pathogenesis of disease.
Early studies mainly used biological experiments to obtain high-accuracy experimental results that fundamentally proved the associations of miRNAs and diseases. However, experimental methods are when focusing on the top part of the prediction results, MDAPred method successfully retrieved more real disease miRNAs. The case studies of three cancers further confirmed the ability of MDAPred to discover potential miRNA-disease associations.

Evaluation Metrics
We used 5-fold cross-validation as an evaluation method for predicting the miRNA and disease association performances. We randomly divided the associations of all known disease miRNAs into five equal parts. Of these, 4 were training sets for training the models and the remaining one was used as a test set for evaluation. We regarded the association in the test set as a positive sample and association between all unobserved miRNAs and diseases as a negative sample. In our association prediction ranking, a higher ranking of positive samples indicated better prediction performance.
Using a model based on nonnegative matrix factorization, we obtained predicted scores for miRNAs and diseases and ranked them in descending order. In this descending order, a higher positive example indicated better the prediction performance. For a pair of known associated diseases and miRNAs, if the association prediction score obtained by the model is higher than the threshold δ we set, it is judged as a positive sample. Otherwise, if the predicted score of the counter example is lower than δ, the sample is judged as negative. By varying the size of threshold δ, the corresponding true-positive rate (TPR) and false-positive rate (FPR) can be obtained and are defined as follows, where TP is the number of positive samples, TN is the number of the negative samples, and FN is the number of positive samples misidentified as negative. Correspondingly, FP indicates the number of negative samples misidentified as positive. TPR indicates the proportion of positive samples correctly identified among the total positive samples, and FPR is the misidentified negative samples accounting for all negative samples. By changing the threshold δ, we can obtain different TPR and FPR values. These TPR and FPR were used to plot the receiver operating characteristic (ROC) curve. The overall predicted performance was evaluated by calculating the area under the ROC curve (AUC).
Since the ratio of the number of unobserved miRNA-disease associations (negative samples) to the number of known associations (positive samples) was 1:30, there was a serious class imbalance between the positive and negative samples. Therefore, we used the precision-recall (PR) curve, which is more convincing than the ROC curve [35], as another evaluation standard. Similarly, by changing the threshold, new precision and recall values can be obtained to draw the PR curve and the area of PR curve (AUPR) is calculated. The precision and recall values are defined as follows, Additionally, biologists typically select the top miRNA candidates in the prediction results to verify their associations with diseases through biological experiments. In the prediction results of the top k, a larger number of positive samples appear to indicate more valuable predictions. Therefore, we calculated the recall rate of the top k, which is the ratio of positive samples in top k relative to the total positive samples, as another criterion for evaluating disease and miRNA performance.
Currently, the data for miRNA and disease association showed that most diseases are only associated with a few miRNAs, leading to a lack of sufficient association data to evaluate prediction models. Therefore, we selected 15 common diseases from the database for cross-validation and simulation experiments, each with a well-characterized disease and typically associated with at least 80 miRNAs.
DMPred exploited nonnegative matrix factorization to predict candidate miRNAs and achieved better performance. You et al. proposed a method called PBMDA which inferred disease-related miRNA by exploiting the information of paths connecting miRNAs and disease. GSTRW is a prediction miRNA-disease association method based on random walk. Liu's method inferred potential candidate miRNAs by exploiting the network topology information. BNPMDA predicted disease-related miRNA based on hierarchical clustering. Figure 1 demonstrates the receiver operating characteristic (ROC) and precision-recall (PR) curves of MDAPred and the other five methods. As shown in Figure 1A and Table 1, MDAPred method achieved the best average performance (AUC = 0.964) among all 15 diseases that we considered. In particular, it outperformed DMPred by 3.1%, PBMDA by 9.1%, GSTRW by 15.8%, Liu's method by 6.0%, and BNPMDA by 12.5%. We also listed the AUC of all six methods on 15 well-characterized human disease (Table 1), MDAPred yielded the best performed for 13 of the common diseases. GSTRW used disease similarities and miRNA similarities when predicting the candidate miRNAs but did not consider the disease miRNA associations. Therefore, GSTRW showed the lowest performance. As shown in Figure 1A, the ROC curves of both BNMPDA and PBMDA overlapped. PBMDA using path information performed better than the BNMPDA using layer clustering. Liu's method achieved better results than the above two methods. Although these methods use different calculations, they make full use of the topology information of heterogeneous networks. DMPred based on nonnegative matrix factorization used network topology and the original features of miRNAs and diseases for predicting associations, which achieved a competitive prediction performance. MDAPred is also based on a nonnegative matrix algorithm. Unlike DMPred, this method considers not only node attributes but also uses projection to obtain the association prediction. Figure 1A and Table 1 show that MDAPred exhibited the best performance against 15 common diseases. As shown in Figure 1B, the average PR curve of the 15 common diseases of MDAPred was higher than that of the other five methods. The average AUC of MDAPred was 10.3% better than DMPred, 16.7% better than PBMDA, 38% better than GSTRW, 14% better than Liu's method, and 24.4% better than BNPMDA. Of the 15 common diseases, MDAPred showed the best performance in 14 of these diseases ( Table 2). A higher recall rate of the top k of miRNAs indicates that more true miRNAs associated with diseases are correctly identified. The top k average recall rate for 15 common diseases is shown in A higher recall rate of the top k of miRNAs indicates that more true miRNAs associated with diseases are correctly identified. The top k average recall rate for 15 common diseases is shown in Figure 2. Under the various top k, MDAPred method recall was significantly higher than those of the other methods. For the top 30, MDAPred method showed a recall rate of 0.641, the top 60 recall rate was 0.862, and the top 90 recall rate was 0.965. The recall rate of the top 30 for DMPred method was 0.448, for the top 60 was 0.675, and for the top 90 was 0.791. Most recall values determined using PBMDA were close to those obtained using Liu '  In addition, to further verify that the AUCs and AUPRs of MDAPred were significantly higher than those of other methods, we perform a paired t-test. All paired t-test results were less than 0.05, which indicates that MDAPred's performance was significantly better than that of other methods (Table 3).

Case Studies
To demonstrate the ability of MDAPred to discover high-quality candidate miRNAs, we conducted case studies of breast, pancreatic, and lung cancers. Because breast cancer is one of the most common cancers, we used it as an example to analyze its top 50 candidates in detail (Table 4).
Xie et al. used text mining techniques to extract the association between experimentally validated miRNAs and diseases [38]. These associations were further manually verified and have been incorporated into miRCancer database, which contains 632 cancer-associated 6323 miRNA-disease associations. dbDEMC [39] is a differentially expressed miRNA database in human cancers containing 2224 miRNAs differentially expressed in 36 cancers. As shown in Table 3, 39 of the 50 miRNA candidate genes were included in dbDEMC database and 21 candidates were In addition, to further verify that the AUCs and AUPRs of MDAPred were significantly higher than those of other methods, we perform a paired t-test. All paired t-test results were less than 0.05, which indicates that MDAPred's performance was significantly better than that of other methods (Table 3).

Case Studies
To demonstrate the ability of MDAPred to discover high-quality candidate miRNAs, we conducted case studies of breast, pancreatic, and lung cancers. Because breast cancer is one of the most common cancers, we used it as an example to analyze its top 50 candidates in detail (Table 4).
Xie et al. used text mining techniques to extract the association between experimentally validated miRNAs and diseases [38]. These associations were further manually verified and have been incorporated into miRCancer database, which contains 632 cancer-associated 6323 miRNA-disease associations. dbDEMC [39] is a differentially expressed miRNA database in human cancers containing 2224 miRNAs differentially expressed in 36 cancers. As shown in Table 3, 39 of the 50 miRNA candidate genes were included in dbDEMC database and 21 candidates were included in miRCancer database. This suggests that these miRNAs are abnormally expressed in breast cancer and are associated with breast cancer. hsa-mir-520e dbDEMC, PhenomiR, miRCancer 50 hsa-mir-3135b literature [44] PhenomiR database [45] contains miRNAs differentially expressed in diseased tissues compared to normal tissues. Twenty-six candidate miRNAs are present in PhenomiR database, indicating that they are upregulated or downregulated in breast cancer. Although hsa-mir-4480 [41] had a centrality score of 9. It was still described as a breast cancer-related miRNA in the SKBR3 network. Hsa-mir-885 [40] directly targets B7-H3 by association with the B7-H3 3 -UTR region, suggesting that hsa-mir-885 have a direct role in modulating B7-H3 protein expression in breast cancer. Chaluvally-Raghavan et al. [42] demonstrated that hsa-miR-569, which is overexpressed in a subset of ovarian and breast cancers, at least in part owing to the 3q26.2 amplicon, alters cell survival and proliferation. Xian Wang et al. [43] performed a differential expression profile analysis of hsa-mir-4454 in breast cancer cells. Junjun et al. [44] confirmed that hsa-mir-3135b is differentially expressed in the breast cancer cell line MCF7. Hsa-mir-6838 is marked "Unconfirmed" and thus not currently supported by the databases and the relevant literature.
Supplementary Table S1 lists the top 50 candidates associated with lung cancer. DbDEMC database contains 35 candidates showing abnormal expression in lung cancer, and 31 candidate miRNAs are present in miRCancer database, demonstrating their association with lung cancer disease. Thirty-seven candidate miRNAs are present in PhenomiR database, showing their expression levels significantly altered in lung cancer cells. NCIH460, a lung cancer cell line, was treated with a screening library, revealing the ability of hsa-mir-4480 [46] to inhibit the growth of lung cancer cells. Park et al. [47] showed that hsa-mir-1843 is significantly upregulated compared with normal lung tissue. Long noncoding RNA NEAT1 promotes non-small cell lung cancer progression through regulation of the hsa-mri-4262 pathway [48]. In addition, EZH2 and miR-4448 show mutual negative regulations for tumor progression via epithelial mesenchymal transition in small cell lung cancer [49]. Hsa-mir-3161 is listed as differentially expressed miRNAs in lung adenocarcinoma by Gou et al. [50]. Hsa-mir-3074-5p is also significantly correlated with small cell lung cancer metastasis [51].
For pancreatic cancer, the top 50 candidate associations are listed in Supplementary Table S2. Forty-eight and 18 candidates are present in dbDEMC and miRCancer databases, respectively, indicating that they are associated with the disease. Forty candidate miRNAs are present in the PhenomiR database, suggesting that the expression levels of this gene in pancreatic cancer cells significantly differ from those in normal tissues.
The data of disease and miRNA used herein was derived from the latest Human miRNA-Disease Database (HMDD, released in March 2019) [52], which contains 7908 miRNA-disease association pairs that have been validated by biological experiments. Disease terms from the American Medical Library (Mesh, hattp://www.ncbi.nlm.nih.gov/mesh) were used to construct directed acyclic graphs (DAGs) to calculate the semantic similarities of the disease. We obtained the disease phenotypic similarity [53] information from previous work. The information of 530 miRNA families is extracted from miRBase (version 22.1) [54]. According to previous studies, we obtained 1309 clusters by setting the distance between two miRNAs to no more than 20 kb.
The primary goal of the study was to predict disease-miRNA associations. To integrate miRNA similarities, disease similarities, miRNA-disease association, and miRNA node attributions, a model based on nonnegative matrix factorization was constructed (Figure 3), and then this model was solved with an iterative algorithm. This model can reveal association scores of miRNAs m i and diseases d j . A higher association score indicates a greater likelihood of an association.

Data Representation of miRNAs and Diseases
MiRNA similarities. It is well-known that miRNAs with similar functions are often associated with similar diseases. Wang et al. [19]  Disease similarities. From the dual perspectives of disease semantics and phenotypes (signs and symptoms), we measured the similarity of two diseases. Generally, we used a DAG to represent disease-related semantic terms. A larger number of common terms on the DAG for two diseases reflects greater similarity between the two diseases. If the two diseases have more common phenotypes, then the two diseases are more similar. Therefore, we quantified the similarities of diseases based on the semantics and phenotype of the disease (Figure 3b). Xuan et al. [21,23,31,55] successfully integrated this information and calculated the similarity of diseases, which we obtained from the previous method. The similarity matrix D = D ij ∈ N d ×N d containing N d diseases indicates the similarity of disease d i and disease d j ; a larger value indicates greater similar, and the value of D ij is generally between 0 and 1.
MiRNA-disease associations. According to the known associations between miRNAs and diseases, an associations matrix A = A ij ∈ N m ×N d was constructed (Figure 3c). Each row of the association matrix A corresponds to a miRNA, of which the column corresponds to a disease. If the miRNA m i is associated with a disease d j , then A ij = 1. If m i and d j are not associated or no association has been observed so far, then A ij = 0.
MiRNA node attributes. C ∈ N m ×(N f +N c ) is a miRNA family and cluster characteristic matrix, with the rows representing miRNAs and columns showing family or cluster information (Figure 3d). Vector C i represent miRNA m i subordinate to N f family and N c cluster, which are considered node attributes. C ij C i(N f +k) = 1 indicates that the miRNA belongs to the j th family or k th cluster; otherwise, the value is 0.

Prediction Models for Disease-miRNA Associations
A model based on nonnegative matrix factorization was constructed, which integrates miRNA similarities, disease similarities, miRNA and disease associations, as well as miRNA family and cluster information. Let U ∈ N m ×N d indicate the predicted miRNA associated score with the disease. N m is the number of miRNAs, N d is the number of diseases, and U ij is the score of the miRNA and disease association. A larger score means that m i and d j are more likely to be associated, and U ij is typically greater than or equal to 0.
Projection of miRNA, disease, and node attributes. We projected miRNA disease-related information into low-dimensional space to extract representative low-dimensional feature vectors. For the miRNA, M denotes the miRNA similarities matrix, which is projected into the c-dimensional space. X ∈ N m ×c is a projection matrix of miRNA similarities, MX ∈ N m ×c represents the low-dimensional feature matrix of the miRNAs, and the i th row of MX represents the low-dimensional feature vector about m i . For the disease, D is the similarities matrix of the disease, which can be projected into the low-dimensional space, and the low-dimensional feature matrix can be obtained. Y ∈ N d ×c is a projection matrix of disease similarities, DY ∈ N d ×c is the low-dimensional feature matrix of the disease, and the j th row of DY represents the low-dimensional feature vector about d j .
For the miRNA of the node attributes, C ∈ N m ×(N f +N d ) is the feature matrix of the family and cluster, which is projected into the low-dimensional space to obtain the low-dimensional feature matrix of the node attributes of the miRNA. Z ∈ (N f +N c )×c is the projection matrix of the node attributes. CZ ∈ N m ×c is a miRNA low-dimensional feature matrix with node attributes, and its i th row is a low-dimensional feature vector of the miRNA family and cluster.
Modelling miRNA-disease associations. In association matrix A, the values of all 1 represent the observed miRNA disease association, 0 indicates that an association has not been observed, and most values of 0 indicate that the miRNA is not associated with the disease. The association matrix A reflects the true associations between miRNAs and diseases. The element U ij in the score matrix U indicates the possibility that the miRNA is associated with a disease. The evaluated score matrix U should be as consistent as possible with the actual correlation. The objective function is obtained as follows, where · F is the Frobenius norm of a matrix. Modelling similarities of miRNAs and diseases. The i th row of the low-dimensional feature matrix (MX) ∈ N m ×c represents the feature vectors of the miRNA m i in the c-dimensional space. Similarly, the j th column of (DY) T ∈ c×N d represents the feature vector of the disease d j in the c-dimensional space. The closer the miRNA m i is to the disease d j in the c-dimensional space, i.e., the larger the value of (MX) i (DY) T j , the more likely m i is associated with d j . An element of the score matrix U ij denotes the probability that the predicted m i is associated with d j . U ij and (MX) i (DY) T j should be as consistent as possible. An objective function expansion was obtained as follows, where α 1 is a hyperparameter for adjusting the contribution of the second section.
Modelling node attributes of miRNAs. (CZ) i is the i th row of the matrix (CZ) ∈ N m ×c , which records the low-dimensional feature vector of m i based on the miRNA and node attribution. Correspondingly, (DY) T j is the i j th row of the matrix of (DY) T , which records the low-dimensional feature vector. The more consistent (CZ) i and (DY) T j , the more likely m i is associated with d j . U ij is the estimated association score of m i and d j . To make the predicted score matrix U and actual calculated association as consistent as possible, our objective function is expanded, where α 2 is the contribution of the adjustment node attribute information.
Modelling the topological structure of miRNAs. miRNAs and k neighbours are more likely to be associated with similar diseases. A graph model S based on similarity between miRNA and miRNA was created, S ij = 1, if miRNA m i is one of the similar neighbours of miRNA m j 0, otherwise .
The graph Laplacian matrix L of miRNA feature graph S is defined as follows, where W is a diagonal matrix with W(i, i) = N m j S(i, j). Graph models are used to introduce smooth regularization, as miRNA with similar features should have similar diseases. The graph model is used to reflect the correlation and similarity of known indications between different miRNAs. The objective function is expanded as follows, where α 3 is a hyperparameter that adjusts the contribution of the regularization of graphs to the entire objective function and Tr() represents the trac of the matrix.
Consider the sparseness of associations. Since a disease is only associated with a limited number of miRNAs, we imposed l 1 -regularization to U learn sparse associations. The objective function is expanded as follows,

Optimization
The objective function L(U, X, Y, Z) in Equation (9) is a non-convex function, and it is impractical to obtain its global optimal solution. We divided the function into four subproblems to obtain a near-optimal solution for L(U).
U-subproblem. When X, Y, and Z are fixed, the subproblem for solving U is as follows, According to the trace property and Frobenius norm of the matrix, L(U) can be rewritten as follows, where Tr() is the trace of the matrix. By setting the derivative of L(U) with respect to U to 0, we obtain the following equation, where B = B ij ∈ N m ×N d is a matrix of which the elements are all 1. By multiplying both sides of Equation (12) by U ij , we obtain the following equation Finally, according to the coordinate descent algorithm, we can obtain U ij 's updated formula by multiplying its current value with the ratio of the negative terms to the positive term of Equation (13), X-subproblem. When U, Y, and Z are fixed, the subproblem for solving X is, According to the trace property and Frobenius norm of the matrix, L(X) can be rewritten as, By setting the derivative of L(X) with respect to X to 0, we obtain the following equation, By multiplying both sides of Equation (17) by X ij , we obtain the following equation, X's updating rule by applying the coordinate gradient descent algorithm is as follows, Y-subproblem. When U, X, and Z are fixed, the subproblem for solving Y is as follows, We transformed the Frobenius norms of the matrices in L(Y) to their trace norms and rewrote L(Y) as follows, Similar to the process for solving the subproblems of U, X, and Y, L(Z) is transformed first according to the characteristic of the matrix traces. The derivative is then determined with respect to Z. Finally, the gradient descent algorithm is applied to obtained the updated rule for Z, The iterative process is over when the absolute difference of L(U, X, Y, Z) at two adjacent moments is less than a threshold (ε = 10 −6 ) or when the maximum number of iterations, 100, is reached. Finally, U ij is regarded as the estimated association score between miRNA m i and disease d j (Figure 4).

Conclusions
In the current study, MDAPred, a new method based on nonnegative matrix factorization, was developed for predicting potential disease-miRNA candidates. MDAPred deeply integrates the projections of multiple kinds of connecting edges and the node attributions of miRNAs to enhance the detection of the disease-miRNA associations. MDAPred also takes full advantage of information about the neighbours of miRNAs to capture the local topology of miRNAs. A sparse penalty was introduced to improve the performance of MDAPred. An iterative algorithm was proposed to obtain discriminative ability. MDAPred was superior to other tested methods not only in their AUCs but also in their AUPRs. Additionally, MDAPred is useful for biologists, as it can list more real disease-miRNA associations in its top ranking list. Case studies of three diseases revealed the ability of MDAPred to identify potential candidates. Therefore, MDAPred can serve as a prioritization tool for identifying real associations of disease miRNAs through wet-lab experiments.
Supplementary Materials: The following are available online. Author Contributions: P.X. and L.L. conceived the prediction method, and L.L. wrote the paper. Y.Z. and Y.S. developed the computer programs. P.X. and T.Z. analyzed the results and revised the paper.