Predicting miRNA-Disease Associations by Incorporating Projections in Low-Dimensional Space and Local Topological Information

Predicting the potential microRNA (miRNA) candidates associated with a disease helps in exploring the mechanisms of disease development. Most recent approaches have utilized heterogeneous information about miRNAs and diseases, including miRNA similarities, disease similarities, and miRNA-disease associations. However, these methods do not utilize the projections of miRNAs and diseases in a low-dimensional space. Thus, it is necessary to develop a method that can utilize the effective information in the low-dimensional space to predict potential disease-related miRNA candidates. We proposed a method based on non-negative matrix factorization, named DMAPred, to predict potential miRNA-disease associations. DMAPred exploits the similarities and associations of diseases and miRNAs, and it integrates local topological information of the miRNA network. The likelihood that a miRNA is associated with a disease also depends on their projections in low-dimensional space. Therefore, we project miRNAs and diseases into low-dimensional feature space to yield their low-dimensional and dense feature representations. Moreover, the sparse characteristic of miRNA-disease associations was introduced to make our predictive model more credible. DMAPred achieved superior performance for 15 well-characterized diseases with AUCs (area under the receiver operating characteristic curve) ranging from 0.860 to 0.973 and AUPRs (area under the precision-recall curve) ranging from 0.118 to 0.761. In addition, case studies on breast, prostatic, and lung neoplasms demonstrated the ability of DMAPred to discover potential disease-related miRNAs.


Introduction
Several studies have shown that the abnormal expression of microRNAs (miRNAs) is inextricably related to the occurrence and development of diseases [1][2][3][4][5]. As the number of identified miRNAs continues to increase, a large number of disease-related miRNAs (disease miRNAs) are waiting to be identified.
Some of the methods previously used to predict diseases-associated miRNAs can be divided into two categories. The first category includes the use of regulatory relationships between miRNAs and their target genes to predict potential associations between the miRNA and the disease [6]. Since the number of experimentally validated target genes is not sufficient, some predictive algorithms such as PITA [7], TargetScan [8], and MiRanda [9] are needed to extrapolate the existence of target gene-miRNA associations [10][11][12][13]. The likelihood of a miRNA associated with a disease is predicted based on the similarity or interaction between disease-related target genes and miRNA-related target

Materials and Methods
Our aim was to predict potential miRNAs associated with diseases using the DMAPred method. First, a dual heterogeneous network composed of nodes, miRNAs, and diseases, was constructed to represent multiple relationships between miRNAs and diseases. Then, a new prediction model based on non-negative matrix factorization was applied to take into account the disease similarities, miRNA similarities, and associations between miRNAs and diseases. Finally, we obtained the final prediction scores for disease and miRNA by iterative optimization formula.

Dataset
Human miRNA-disease database (HMDD) has collected a great many associations between miRNAs and diseases that have been experimentally confirmed [34]. We got 5088 known associations from HMDD, which involved 490 miRNAs and 326 diseases. Disease terms were obtained from the National Library of Medicine (http://www.ncbi.nlm.nih.gov/mesh) to construct a directed acyclic graph (DAG) of diseases. The disease semantic similarity and phenotypic similarity were obtained from previous work [17].

Establishment of the miRNA-Disease Dual Heterogeneous Network
The dual heterogeneous network consisted of two types of nodes and three types of networks, which is the similarity network of miRNAs, the similarity network of diseases and the bipartite network between miRNAs and diseases.
Establishment of the miRNA network: The miRNA network (MiNet) was established on the similarity between miRNAs (Figure 1a). If two miRNAs were similar, we put an edge between two corresponding nodes. Every edge has a weight distributed between 0 and 1 to indicate the similarity   Two miRNAs that have similar functions are usually associated with similar diseases. Wang et al. [18] successfully calculated the similarity of miRNAs based on the similarity between the diseases that they were associated with. For example, miRNA is associated with a group of diseases = , , , miRNA is associated with a group of diseases = , , , , the similarity between and is calculated based on the similarity of and . The miRNA similarity that we used was calculated by the Wang's method.
Establishment of the disease network: The disease network is built on the similarity of diseases (Figure 1b). Every node in the disease network indicates a disease. We added an edge between two corresponding nodes when the two diseases were similar. The weight of every edge is the similarity between two diseases at both ends and is a positive number less than 1. The similarity between two diseases was estimated by disease semantic and phenotype [20]. The more common the disease semantic and phenotype, the more similar are the two diseases, and therefore the higher the possibility of associating with similar miRNAs.
The matrix = ∈ × represents the disease network, with symbolizing the similarity between the disease and disease and the values of similarity are distributed between 0 and 1. The number of the diseases in disease network is . Two miRNAs that have similar functions are usually associated with similar diseases. Wang et al. [18] successfully calculated the similarity of miRNAs based on the similarity between the diseases that they were associated with. For example, miRNA m i is associated with a group of diseases P i = {d 3 , d 4 , d 6 }, miRNA m j is associated with a group of diseases P j = d 1 , d 2 , d 4, d 8 , the similarity between m i and m j is calculated based on the similarity of P i and P j . The miRNA similarity that we used was calculated by the Wang's method.
Establishment of the disease network: The disease network is built on the similarity of diseases ( Figure 1b). Every node in the disease network indicates a disease. We added an edge between two corresponding nodes when the two diseases were similar. The weight of every edge is the similarity between two diseases at both ends and is a positive number less than 1. The similarity between two diseases was estimated by disease semantic and phenotype [20]. The more common the disease semantic and phenotype, the more similar are the two diseases, and therefore the higher the possibility of associating with similar miRNAs.
The matrix D = D ij ∈ R N d ×N d represents the disease network, with D ij symbolizing the similarity between the i th disease and j th disease and the values of similarity are distributed between 0 and 1. The number of the diseases in disease network is N d .
Establishment of the miRNA-disease bipartite network: A bipartite network that records the associations between diseases and miRNAs was constructed by adding the edge between two types of nodes ( Figure 1c). This network is dissimilar from the other networks in that it contains two types of nodes and each edge connects two different types of nodes. If we identify from known association data that the disease d j is associated with the miRNA m i , we add a side between corresponding nodes, and the weight of the edge is 1. Otherwise, when the associations between disease d j and the miRNA m i has not been discovered or does not exist, there is no edge between the nodes.
The matrix A = A ij ∈ R N m ×N d was constructed to record weight information for each edge of the bipartite network. The i th row of A is denoted as the associations between the miRNA m i and all the diseases, and the j th column of A is denoted as the associations between the disease d j and all the miRNAs. A ij is 1 when m i are observed to be associated with d j or 0 otherwise.

miRNA-Disease Association Prediction Model
The proposed prediction model for predicting the potential miRNA-disease associations integrated multiple sources from three networks (namely, MiNet, DisNet, and MiDisNet). To make it easier to understand, we introduced a matrix U = U ij ∈ R N m ×N d . The matrix U is used to describe the scores of the association possibility between N m miRNAs and N d diseases, where U ij is a non-negative number indicating the association possibility between m i and d j .
Modeling miRNA similarities: Three types of connections in MiDisNet can be used to construct the prediction model. The first type is the similarities between miRNAs in MiNet. Matrix M describes the miRNA similarities, where each row corresponds to the similarity between a miRNA and other miRNAs. For example, the i th row of M is denoted as the similarity between m i and all the other miRNAs. Data representation often has a large impact on the performance of the model. Projecting high-dimensional information into low-dimensional space contributes to the reduction of the original redundant information, thereby obtaining more dense and low-dimensional feature representations of the data. Therefore, we projected miRNA similarities in low-dimensional space by non-negative matrix factorization. Suppose M = [M 1 , M 2 , · · · M N m ] ∈ R N m ×N m is the non-negative N m data represents, where M i is the i th column of M and represents the N m -dimensional original feature representation of the i th miRNA. Let W = R N m ×k and H = R k×N m be the base matrix and the new representations of data in terms of the basis W and k is the dimension we require: The result of W and H can well approximate the original matrix. Thus, we aimed to minimize the following objective function, where · F is the Frobenius norm of the matrix. Modeling disease similarities: The second type of connection is similarities between diseases. The j th column of D represents the similarities between d j and all the diseases. We also projected disease similarities into low dimensional space similarly to the miRNAs to receive new representation of the diseases. Suppose where each column is an original feature representation of a disease. Let X ∈ R N d ×k be the base matrix and C ∈ R k×N d be the new data vector of diseases. The disease similarities are projected as follows, Genes 2019, 10, 685

of 15
Our aim was to find two matrices X and C whose product was closer to the original matrix. To better measure the matrix fitting, we added an item to the loss function, where α is a hyperparameter used to adjust the contribution of the disease similarity.
Modeling the miRNA-disease associations: The third type of connection is the association between miRNAs and diseases. The miRNA-disease connections are recorded in matrix A in which each 1 represents an observed association. The matrix A was very sparse due to the small number of associations observed. Our model only considered the known associations in this situation. Y = Y ij ∈ R N m ×N d was defined as an indicator matrix, and Y ij = 1 if A ij = 1 or 0 otherwise. The predicted scores for associations between N m miRNAs and N d diseases were recorded in U. The estimated association possibilities should be as close as possible to the known associations. As a result, we extended the objective function, where is the multiplication of the corresponding elements of the matrix and β is a hyperparameter.
Modeling the characteristics in the low-dimensional space: H ∈ R k×N m is the low-dimensional representation matrix of N m miRNAs, where the i th column is m i . C ∈ R k×N d is the low-dimensional feature matrix of N d diseases, in which the j th column is d j . m i ∈ R k and d j ∈ R k indicates the feature vectors of the i th miRNA and the j th disease, respectively. Our goal was to derive the association score between miRNA and disease by updating U in the model U = H T C. Therefore, the loss function becomes, where λ is a hyperparameter.
Considering the sparse characteristic of associations: There are several diseases associated with a miRNA. Hence, the miRNA-disease associations have a sparse characteristic. We used 1-norm to ensure that the matrix U was sparse and added an item to the objective function as follows, Therefore, the non-zero elements in the matrix U were sparse. Modeling local topological information of the miRNAs: A miRNA and its k neighbors are usually associated with similar diseases. First, a graph model S was constructed, based on the similar properties of miRNAs. Each element in S was calculated according to the following formula, u j and u l are the associations between miRNA m j and m l and all the miRNAs, respectively. Set S jl to 1 when m l is the k-nearest neighbor of m j . Thus, u j and u l should be as consistent as possible. Then, the finally loss function becomes, where ||· || is the 2-norm; δ and η measure the contribution of the corresponding item in the formula.

Optimization
The objective Function (7) is represented by F, which is a non-convex function. Therefore, it cannot guarantee direct global optimal solution. We proposed an iterative method to optimize the objective Function (7), and divide the problem of solving the objective function F into five sub-problems about the matrix U, W, H, X, and C. Then, the local optimal solution was found for each of the five sub-problems to obtain the global optimal solution. According to the conversion relationship between the trace property and the Frobenius norm of the matrix, F can be written as following, Tr(·) represents the trace of the matrix, which is the sum of the values on the main diagonal of the matrix. Here V ∈ R N m ×N m is a diagonal matrix where each element is defined as When updating U, the other four matrices W, H, X, and C were fixed.
The sub-problem about U can be written as, The derivative of the objective function for U was set to 0. Then there is: After multiplying both sides of the above equation by U ij , the following formula was obtained, Finally, according to the gradient descent algorithm, we obtained the local optimal solution of U in the current situation. Updated U was as follows, H sub-problem: When the matrices U, W, X, and C are fixed, the sub-problem about H can be written as, Let the derivative of the objective function F to H be 0. Then we have: Multiply both sides of the equation by A, we obtained: Then, the same method was used to find the formula to update W, X, and C. The remaining four matrices were fixed when updating a matrix. We obtained three optimization formulas for the other matrices, The j th column of the final matrix U represents the association scores between the j th disease and all miRNAs ( Figure 2). The miRNAs associated with the disease were not found to be sorted according to the association score in U. In the ordered list, the higher the position of the miRNAs based association score, the more likely it is to be a potential miRNA associated with the disease. Algorithm: The DMAPred algorithm for predicting the potential diseases-related miRNA candidates based on non-negative matrix factorization.
Input: The disease similarity matrix , the miRNA similarity matrix , and the miRNA-disease association matrix .
Output: The miRNA-disease association score matrix where is the association score between miRNA and disease . 1. Randomly initialize five matrices U, H, W, X, and C, each of which is within the range [0, 1], 2. While are not converged do, 3.
Fix the metrics W, H, X, and C when the is updating as follows,

5.
Fix U, W, X, and C, and update H by using the rule, 6.

7.
Fix U, H, X, and C, and update W by the following formula,

9.
Fix U, H, W, and C, and update X by the following formula,

11.
Fix U, H, W, and H, and update C by the following formula,

Performance Evaluation
To evaluate the algorithm performance, we performed fivefold cross validation. In the fivefold cross validation, all known associations between miRNAs and drugs were randomly divided into five subsets. Each time, we used four subsets to train the model, and the remaining one was used as a test set. For a disease , miRNAs associated with disease are considered positive, and unlabeled miRNAs that were not associated with disease, were considered negative. The higher the positive samples order, the better the prediction performance of the algorithm.
Given a threshold , if the associated prediction score was greater than , it was judged as a positive example, otherwise it will be judged as negative. The true positive rate (TPR) and false positive rate (FPR) according to the following formulas,

Performance Evaluation
To evaluate the algorithm performance, we performed fivefold cross validation. In the fivefold cross validation, all known associations between miRNAs and drugs were randomly divided into five subsets. Each time, we used four subsets to train the model, and the remaining one was used as a test set. For a disease d j , miRNAs associated with disease d j are considered positive, and unlabeled miRNAs that were not associated with disease, were considered negative. The higher the positive samples order, the better the prediction performance of the algorithm.
Given a threshold θ, if the associated prediction score was greater than θ, it was judged as a positive example, otherwise it will be judged as negative. The true positive rate (TPR) and false positive rate (FPR) according to the following formulas, The larger the AUC, the better the comprehensive prediction performance.
In the miRNA-disease association data, the number of known associations was much smaller than the unknown association, which created a serious imbalance between the positive and negative samples. In the case of positive and negative imbalances, precision and recall are more suitable for measuring the performance of the method. The precision P and the recall R are defined as,

Comparison with Other Methods
To confirm that the proposed method has a superior performance in predicting potential miRNA candidates associated with diseases, we compared DMAPred with Liu's method [22], DMPred [35], PBMDA [29], GSTRW [36], and BNPDMA [37], which are state-of-the-art prediction methods for miRNA-disease associations. Liu et al. integrated the similarities and associations between miRNAs and diseases to propose a method of random walks with a restart in a heterogeneous miRNA-disease network to predict the association score between a miRNA and a disease. You et al. proposed a method, PBMDA, based on the path to predict the likelihood of a miRNA association with a disease. This method not only integrates the similarity of miRNA functions and the semantic similarity of diseases, but also considers the similarity of the Gaussian interaction spectrum between miRNAs and diseases. Xuan et al. proposed DMPred, based on non-negative matrix factorization, to predict the associations between miRNAs and diseases taking into account the sparse nature of miRNA disease associations. Chen et al. proposed a method, called GSTRW, that calculates the global similarity of a network and predicts the association between a miRNA and a disease by performing random walks in miRNA and disease similarity networks, respectively. BNPDMA uses a bipartite recommendation algorithm to predict potential disease-associated miRNAs by assigning bias ratings to the associations between miRNAs and diseases.
Several hyperparameters in the objective function might impact the performance of the proposed algorithm. By enumerating the sensitivity of each parameter, we selected the values of the parameters α, β, λ, δ, η from {0.1, 0.4, 0.8, 1, 4, 8}. The contribution of each parameter to the algorithm was measured by varying each parameter to compare the AUC values. Finally, we established the parameters as α = 0.1, β = 0.1, γ = 0.1, and δ = 1, η = 0.4 by comparing the AUC values for the different parameters.
The predictive performances of the proposed method and Liu's method, DMPred, GSTRW, PBMDA, and BNPMDA for all the diseases were compared based on different evaluation criteria. Figure 3a shows the average ROC curves for DMAPred and the other five methods for the 326 diseases. The average AUC values obtained with DMAPred, Liu's method, DMPred, GSTRW, PBMDA, and BNPDMA were 0. 927, 0.859, 0.901, 0.810, 0.834, and 0.823, respectively. associations between miRNAs and diseases taking into account the sparse nature of miRNA disease associations. Chen et al. proposed a method, called GSTRW, that calculates the global similarity of a network and predicts the association between a miRNA and a disease by performing random walks in miRNA and disease similarity networks, respectively. BNPDMA uses a bipartite recommendation algorithm to predict potential disease-associated miRNAs by assigning bias ratings to the associations between miRNAs and diseases.
Several hyperparameters in the objective function might impact the performance of the proposed algorithm. By enumerating the sensitivity of each parameter, we selected the values of the parameters , , , , from 0.1, 0.4, 0.8, 1,4, 8 . The contribution of each parameter to the algorithm was measured by varying each parameter to compare the AUC values. Finally, we established the parameters as = 0.1, = 0.1, = 0.1, and = 1, = 0.4 by comparing the AUC values for the different parameters.
The predictive performances of the proposed method and Liu's method, DMPred, GSTRW, PBMDA, and BNPMDA for all the diseases were compared based on different evaluation criteria. Figure 3a    The proposed method, DMAPred, achieved the best performance, with the average AUC value being higher than those obtained using the other five methods by 6.8%, 2.6%, 11.7%, 9.3%, and 10.4%, respectively. The faster the TPR values grow versus FPR values, the larger the AUC value for the corresponding ROC curve is. However, the growth rate of TPR is affected by the predicted association scores of positive samples. The larger the predicted score of the positive samples is, the closer our prediction results are to the actual values and the faster the TPR grows. Among the five other methods, the performance of the DMPred method was the second best. This method is based on the matrix factorization, similar to our method, although the calculation of disease similarity and miRNA similarity takes into account factors different from ours. Liu's method was a little worse than other methods, the main reason being that the calculation of similarity between miRNAs is indirectly measured by genes and LncRNA, and does not take into account the direct relationship between miRNA and disease. The GSTRW method was the worst of the four methods probably because it uses a two-layer random walk. We also list the AUCs for 15 well-characterized diseases associated with at least 80 miRNAs (Table 1). DMAPred achieved the best predictive performance for 10 of the 15 well-characterized diseases.
The PR curve reacts better than the ROC to reflect the predictive performance of different methods when the positive and negative examples in the data set are unbalanced. Figure 3b shows the PR curve for DMAPred and the other five methods with an average AUPR of 0.445, 0.389, 0.349, 0.193, 0.334, and 0.346 for 326 diseases. The performance of DMAPred was evaluated as the best and GSTRW was the worst. DMAPred was 5.6%, 9.6%, 25.2%, 11.3%, and 9.9% higher than the other methods. Table 2 shows the AUPR values of DMAPred and the other five methods for 15 diseases. DMAPred achieved best performance for 10 among the 15 diseases.   In addition, we conducted a t-test to further prove that our method was superior to others in AUC and AUPR. All paired t-test results less than 0.05 means that our method was better than the other methods (Table 3).
with 59.19% in the top 30 candidates, 84.67% in the top 60, and 94.88% in the top 90. DMPred's performance achieved the second best, with 56.76% in the top 30 candidates, 79.82% in the top 60, and 91.68% in the top 90. Liu's method was slightly worse, with 50.01% in the top 30 candidates, 70.52% in the top 60, and 81.84% in the top 90. The performance of PBMDA showed with 50.11% in the top 30 candidates, 70.14% in the top 60, and 79.49% in the top 90. GSTRW was the worst, with recalls of 26.90%, 57.79%, and 75.89%, respectively.

Case Studies on Breast Neoplasms, Prostatic Neoplasms, and Lung Neoplasms
To further demonstrate our approach in identifying potential disease-related miRNAs, we conducted case studies for the top 50 candidates for breast neoplasms, prostate neoplasms, and lung neoplasms. The top 50 candidates related to breast neoplasms are listed for detailed analysis and verification (Table 4). The databases involved were dbDEMC [44] and PhenomiR [45]. The dbDEMC database contained 807 miRNAs with significant abnormal expression levels in human cancer and has an online public database. The PhenomiR database contains miRNA expression information that is differentially regulated during disease, and its data was extracted from more than 365 scientific articles. Using the dbDEMC database, we found 42 of the 50 candidates were up-regulated or down-regulated in breast neoplasms. Thirty-five of the 50 miRNA candidates were included in PhenomiR. The remaining five miRNAs labeled 'Literature' were supported by relevant research literatures.
The top 50 candidates associated with prostate neoplasms are listed in supplementary table ST1. Abnormal expression of 39 candidates in prostate neoplasms was included in the dbDEMC2 database and 36 candidates were included in the PhenomiR database. Three candidates marked 'Literature' means that it was supported by the relevant literatures. There were several miRNAs labeled 'Unconfirm', which were associated with prostate neoplasms without a relevant database or literature support.
The top 50 candidates associated with lung neoplasms are shown in supplementary table ST2. Abnormal expression of 29 candidates with up-regulation or down-regulation in lung neoplasms was recorded in the dbDEMC2 database, and seven candidates were confirmed by relevant literature. The PhenomiR database included abnormal regulation of 17 candidates in the lung neoplasms. Analysis of breast neoplasms, prostate neoplasms, and lung neoplasms predictions further demonstrates the ability of our methods to predict disease-associated miRNAs.

Conclusions
The method based on non-negative matrix factorization, DMAPred, was developed to predict potential miRNAs associated with diseases. DMAPred captures the internal relationships of miRNAs and diseases, including miRNA similarities and disease similarities, and the relationship between miRNAs and diseases, i.e., miRNA-disease associations. Moreover, local topological information for each node in the miRNA network and dense features of miRNAs and diseases in low-dimensional space also contributes for screening of potential disease miRNA candidates. The objective problem was divided into five sub-problems. An iterative algorithm was developed to obtain the final miRNA-disease association scores that could be used to rank the candidate miRNAs for each disease. In our experiment, DMAPred was found to be superior to several other methods, with regard to both AUCs and AUPRs. In addition, DMAPred can help biologists to find candidates they are interested in because the top ranking list contains more true miRNA-disease associations. Case studies on three diseases confirmed that DMAPred is able to discover potential miRNA candidates associated with specific disease.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/10/9/685/s1. Author Contributions: P.X. and Y.Z. conceived the prediction method, and Y.Z. wrote the paper. L.L. and L.Z. developed the computer programs. P.X. and T.Z. analyzed the results and revised the paper.