Predicting MiRNA-Disease Association by Latent Feature Extraction with Positive Samples

In discovering disease etiology and pathogenesis, the associations between MicroRNAs (miRNAs) and diseases play a critical role. Given known miRNA-disease associations (MDAs), how to uncover potential MDAs is an important problem. To solve this problem, most of the existing methods regard known MDAs as positive samples and unknown ones as negative samples, and then predict possible MDAs by iteratively revising the negative samples. However, simply viewing unknown MDAs as negative samples introduces erroneous information, which may result in poor predication performance. To avoid such defects, we present a novel method using only positive samples to predict MDAs by latent features extraction (LFEMDA). We design a new approach to construct the miRNAs similarity matrix. LFEMDA integrates the disease similarity matrix, the known MDAs and the miRNAs similarity matrix to identify potential MDAs. By introducing miRNAs and diseases knowledge as the auxiliary variables, the method can converge to give the optimal solution in each iteration. We conduct experiments on high-association diseases and new diseases datasets, in which our method shows better performance than that of other methods. We also carry out a case study on breast neoplasms to further demonstrate the capacity of our method in uncovering potential MDAs.


Introduction
MicroRNAs (miRNAs), a class of small endogenous non-coding RNAs, regulate gene expression at a post-transcriptional level through mRNA degradation or translational inhibition [1][2][3]. There is growing evidence that miRNAs are essential in biological process including immunoreaction, transcription, proliferation, differentiation, signal transduction, embryonic development and so on [4][5][6][7][8][9]. miRNA mutation, biosynthesis and dysfunction with the miRNAs of its targets can lead to various diseases [10][11][12][13]. Therefore, it is very important to identify the association between miRNAs and diseases. Early studies determined the relationship between miRNAs and specific diseases via biological experiments. However, biological experiment methods have long cycles and high costs. Therefore, computational biological methods for analyzing and predicting the association between miRNAs and diseases have been receiving great attention.
Currently, the association prediction of miRNAs and diseases has two main categories: one based on network topology, and the other based on machine learning methods. Network topology methods are based on the observation that diseases regulated by functional similar miRNAs are similar and vice L-1 norm constraint. This model is promising in the way that feature selection helps dramatically reduce the dimensionality, and thus enables easy extension to higher dimensional datasets. However, these methods are mainly limited to the feature representation of miRNAs and diseases. You et al. [29] proposed a novel path-based method for MDA prediction (PBMDA). In addition to conventional two similarity and one association, a Gaussian interaction profile kernel is further introduced to measure the similarity between miRNAs and diseases. With all four statistics, a heterogeneous graph consisting of three interlinked sub-graphs is constructed, and then the depth-first search algorithm is applied on it to infer potential MDAs. The algorithm based on matrix factorization solves the problem of feature representation by using high-dimensional space vectors. It constructs the representation of miRNAs and diseases in high-dimensional space at the same time, and then obtains their association. The probability of the final miRNAs-diseases association is solved by the least square method. Such an idea is derived from the widely-adopted method of matrix factorization in recommendation systems. It has been proved that the method is very effective to solve association prediction problems in recent years. Shen [30] first proposed a matrix factorization based method (CMFMDA) to predict miRNAs-diseases associations in 2017. This approach achieved better performance than Chen [27]. However, due to the impact of its loss function, the least squares method cannot be used in the process of iteration. To a large extent, the result depends on the initial value. In many cases, it is difficult to guarantee the stability of the algorithm because it may not converge. Besides, the approaches based on matrix factorization regard the unlabeled association as the negative samples. Thus, they extract the wrong information, which leads to result deviation. A recent study by Chen et al. [31] presents the first decision tree learning-based model for MDA prediction (EGBMMDA). Obeying the routine of integrating the miRNA functional similarity, the disease semantic similarity, and known MDAs, the model uses statistical measures, graph theoretical measures, and each miRNA-disease pair's matrix factorization result to form an informative feature vector. With calculated feature vectors and known associated pairs, a regression tree is trained under the gradient boosting framework, which is further used for predicting potential MDAs. This paper proposes a novel approach called miRNA-disease association prediction using latent feature extraction with positive samples (LFEMDA). First, we design a new miRNAs functional similarity construction method to solve the problem that miRNAs functional similarity is used to predict miRNAs-Disease associations, while sometimes the former is dependent on the latter, which is not desirable in common inference models. Second, LFEMDA introduced miRNAs and disease knowledge as the auxiliary variables so that the optimal solution can be obtained in each iteration of the optimization process. Third, LFEMDA uses only positive samples for feature extraction, and it could reduce the deviation. Finally, LFEMDA achieves great results on both the high-association diseases data and the new diseases data.

Disease Semantic Similarity Network
The disease semantic similarity is calculated by the method of Wang [32], which depends on their common semantic annotations and shared disease symptoms. Every disease can be represented by a directed acyclic graph (DAG), and a disease D is denoted as DAG(D) = (TD, ED), where TD is a set that includes all the ancestor nodes of D and D itself, and ED is a set of direct linking edges of D. The node t (t ∈ TD) is defined as follows: where ∆ is an semantic contribution factor. We set it to 0.5, as suggested in literature [24]. The semantic value of D, DV(D) is defined as: The more terms in DAG are shared between two diseases, the more similar they are. So the semantic similarity between disease d1 and disease d2 is defined as follows:

miRNAs-Disease Association Network
We get the miRNA-disease association information from a HMDD database [33]. The data contains 10,381 experimentally-confirmed associations among 378 diseases and 571 miRNAs. Matrix R represents the associations between miRNAs and diseases. For example, if miRNA m i and disease d j are related, the value R(m i , d j ) is 1, and 0 otherwise.

miRNAs Functional Similarity Network
Based on the assumption that miRNAs with similarity functions are involved in similar diseases, Wang et al. [32] gave a method to get functional similarity between two miRNAs by calculating the similarity between two groups of diseases that are associated with them respectively. Cui et al. developed a tool called MISIM based on Wang's method [32] to measure the pairwise functional similarity of the given miRNAs. MISIM can be downloaded from http://www.cuilab.cn/files/ images/cuilab/misim.zip. Usually, the two disease groups are obtained from miRNAs-Disease associations. However, it leads to a problem that the miRNAs functional similarity used to predict miRNAs-Disease Associations can be actually implied by the prediction target themselves. That is to say, the miRNAs functional similarity is inferred from miRNAs-disease association and disease semantic similarity, but such inferred results may be, in reverse, incorporated in the process of predicting miRNAs-disease association.
To deal with this issue, we designed a new algorithm to obtain the functional similarity of miRNAs from their sequence data. The sequence of miRNA determines its uniqueness and function, so our method can reserve the biological characteristics to the greatest extent.
We defined the functional similarity of two miRNAs as SM (m 1 , m 2 ).
Levenshtein (m 1 , m 2 ) refers to the editing distance of two miRNA sequences. So, we have, The miRNAs functional similarity matrix can be obtained by calculating the functional similarity between miRNAs. Suppose, for example, we have two miRNA sequences. One is hsa-mir-21(CAACACCAGUCGAUGGGCUGU), the other is hsa-mir-155(CUCCUACAUAUUAGC AUUAACA). Their Levenshtein distance is 19, and the functional similar score is 1 − 19/(21 + 22) = 0.5581.

Data Fusion
The final miRNAs similarity matrix (MS) and disease similarity matrix (DS) are obtained by integrating the miRNAs functional similarity network, the diseases semantic similarity network, and the experimentally-confirmed miRNA-disease association network (R). After fusing the above three datasets, there are 446 miRNAs, 322 diseases and 5,512 confirmed miRNA-disease associations in reserve.

Loss Function
In this paper, the idea of matrix decomposition is used to solve the problem of miRNA-disease association prediction. Let MS represent the miRNAs functional similarity network, DS represent the diseases semantic similarity network, and RS represent the miRNA-disease association network.
Firstly, for each miRNA and disease, we give the initial projection vector in a fixed k dimension space, and use their inner product to represent the association between them, which can be denoted as follows: where M is a m × k matrix, and m is the number of miRNAs. Similarly, D is a k × d matrix, and d is the number of diseases. The goal is to minimize the distance between R and the real relationship R by solving the appropriate M and D, which can be expressed as: Only the positive samples are credible, so the Formula (5) can be described as: In addition, the constrained M and D are hoped to match the priori MS and DS in the model, so the loss function can be written as: Considering the terms M i × M T and D j × D T , the quadratic form may exist in the loss function. This prevents us from getting a simplified equation about the interested variables during the optimization, which will make it impossible to get the optimal solution in the iteration process. So, matrix X and Y are introduced as the auxiliary variables to help the optimization. The Formula (9) is transformed as: Empirically, two-norm constraints are added on M and D to prevent the model falling into overfitting. The final loss function is as follows:

Optimization
For Formula (11), there are four variables in the loss function, so there is no method to solve the optimal M and D directly. Thus, we use an iterative least squares approach to get its optimal solution. At the same time, since only positive samples participate in the optimal process, it is hard to optimize the function by matrix calculation. To settle this problem, it will solve the hidden variable of each miRNA and disease. The specific steps are as follows: Firstly, using current D, X, Y to update M. Take the derivative of M i : Let ∂L ∂M i = 0, and then we can get: Similarly, fixing other parameters and solving D, X, and Y respectively: Thus, the optimal solution of M, D, X and Y is obtained. This process will be iterated until it converges.

Prediction
We use the inner product of calculated M and D to obtain a new correlation matrix R = M T D, and R (i, j) is the predicted association of the ith miRNA and the jth disease. In fact, the value of R (i, j) makes sense only when compared with other values in the matrix R . The larger the value, the higher possibility that associations exist. But it is not equal with the probability of existential association. The specific steps of the LFEMDA algorithm are shown in Algorithm 1. The code and data of LFEMDA is freely available at https://raw.githubusercontent.com/kavinche/fantastic-telegram/master/data_ and_code_of_LFEMDA.rar.

Performance Evaluation
We adopted Leave-One-Out Cross-Validation (LOOCV) to evaluate the performance of our approach and other miRNA-disease association prediction methods. For each known miRNA-disease association, it is left out in turn as the test data. All the other known associations are treated as training data. The unknown associations are regarded as candidates. After the association prediction, each pair of disease and miRNA will get a score; the larger the score, the greater the probability of association.
With a predefined threshold, if the score of associated miRNA is larger than the threshold, it is considered as a correctly identified positive sample. Otherwise, it is regarded as a true identified negative sample. Then, the TPR, FPR and Receiver Operating Characteristics (ROC) can be calculated. Finally, the area under the ROC curve (AUC) is selected to measure the performance of the prediction method. Algorithm 1: LFEMDA, predicting miRNA-disease association by latent feature extraction with positive samples Input: MS: m*m miRNAs functional similarity matrix DS: d*d disease semantic similarity matrix R: the experimentally confirmed miRNAs-disease association matrix Paramter: k: hidden space dimension λ 0 : second normal form regularization coefficient λ 1 : the distance coefficient between expression matrix inner product of miRNAs on the hidden space and MS λ 2 : the distance coefficient between expression matrix inner product of diseases on the hidden space and DS µ 1 : the distance coefficient between expression matrix of miRNAs on the hidden space and auxiliary matrix X µ 2 : the distance coefficient between expression matrix of diseases on the hidden space and auxiliary matrix Y Output: R : the predicted miRNAs-disease association matrix Initialize the vector matrices M, D, and the auxiliary vectors X, Y of miRNAs and diseases ∆ ← ∞ , loss ← ∞ while ∆>ε: update M, given current D, X and Y, using Formula (11) update D, given current M, X and Y, using Formula (12) calculate current X based on the new M calculate current Y based on the new D calculate loss_new using Formula (9) ∆ ← loss_new − loss loss ← loss_new End while R = M T D To illustrate the performance of LFEMDA, we compared it with the existing state-of-the-art methods: RWRMDA, CMFMDA, RLSMDA, PBMDA and EGBMMDA. Figure 1 is the comparison result. The hyperparameters in experiment are set as follows: λ 0 = 6.0, λ 1 = 0.1, λ 2 = 0.1, µ 1 = 3.0, µ 2 = 3.0. As is demonstrated in the result, LFEMDA has the highest prediction performance among the compared methods.
calculate current Y based on the new D calculate loss_new using Formula (9) _ loss new loss With a predefined threshold, if the score of associated miRNA is larger than the threshold, it is considered as a correctly identified positive sample. Otherwise, it is regarded as a true identified negative sample. Then, the TPR, FPR and Receiver Operating Characteristics (ROC) can be calculated. Finally, the area under the ROC curve (AUC) is selected to measure the performance of the prediction method.
To illustrate the performance of LFEMDA, we compared it with the existing state-of-the-art methods: RWRMDA, CMFMDA, RLSMDA, PBMDA and EGBMMDA. Figure 1 is the comparison result. The hyperparameters in experiment are set as follows: 0  = 6.0, 1 As is demonstrated in the result, LFEMDA has the highest prediction performance among the compared methods.

Case Study
In Data Fusion section, there are 21 diseases which have more than 60 known associated miRNAs. Here, we can regard them as high association diseases. LFEMDA was compared with RWRMDA, CMFMDA, RLSMDA, PBMDA and EGBMMDA by LOOCV. The AUC results are showed in Table 1. The average AUCs of LFEMDA, RWRMDA, CMFMDA, RLSMDA, PBMMDA and EGBMMDA are 85.22%, 60.40%, 80.08% 63.75%, 76.33% and 82.38% respectively. LFEMDA shows the best performance on 14 high association diseases compared with other methods, and it gets better results on other 7 high association diseases.

Case Study
In Data Fusion section, there are 21 diseases which have more than 60 known associated miRNAs. Here, we can regard them as high association diseases. LFEMDA was compared with RWRMDA,  Table 2. LFEMDA obtained the satisfactory results. Overall, it can be seen that LFEMDA shows excellent results not only on high association diseases, but also on new disease. EGBMMDA and PBMDA get the best results in two situations. The experimental results of CMFMDA are not unsatisfactory in six new diseases, i.e., Moyamoya Disease, Hypoxia-Ischemia Brain, Liver Diseases Alcoholic, Amyotrophic Lateral Sclerosis, Pemphigus Benign Familial and Neuroma Acoustic. For RLSMDA, it performs well on new diseases but poorly on high association diseases.
To further prove the performance of LFEMDA, a case study on breast neoplasms was carried out to demonstrate the prediction ability. Here, we used LFEMDA to identify potential miRNAs related to breast neoplasms. In addition, the prediction results were validated by three miRNA-disease association databases: HDMM, dbDEMC2 [34] and miR2Disease [35]. The top 50 breast-neoplasms-related miRNAs are listed in Table 3. The HDMM and the dbDEMC2 databases have confirmed that all the 50 predicted miRNAs are associated with the disease. The database miR2Disease have identified 47 predicated miRNAs.

Control Experiment with Different miRNA Functional Similarity
In Section 2.3, we describe the reason that we designed a new miRNAs functional similarity computing method in detail. The method to calculate the miRNAs functional similarity scores is usually dependent on the miRNAs-Disease associations, and then the scores and disease semantic similarity are used to predict the associations. To avoid the scores hidden from the association information, we put forward a method to calculate the miRNAs functional similarity by miRNAs sequences. We compare the approaches with different miRNAs functional similarities. One similarity is achieved by our method, the other from MISIM. The result is displayed in Figure 2. The AUC of LFEMDA with our similarity is 92.43%, and that with similarity from MISIM is 88.04%. This illustrates the effectiveness of our method.

Control Experiment with Different miRNA Functional Similarity
In section 2.3, we describe the reason that we designed a new miRNAs functional similari mputing method in detail. The method to calculate the miRNAs functional similarity scores sually dependent on the miRNAs-Disease associations, and then the scores and disease seman milarity are used to predict the associations. To avoid the scores hidden from the associatio formation, we put forward a method to calculate the miRNAs functional similarity by miRNA quences. We compare the approaches with different miRNAs functional similarities. One similari achieved by our method, the other from MISIM. The result is displayed in Figure 2. The AUC FEMDA with our similarity is 92.43%, and that with similarity from MISIM is 88.04%. Th lustrates the effectiveness of our method.

Conclusions
In this paper, we present a miRNA-disease association prediction method using latent featu traction with positive samples (LFEMDA). Leave-One-Out Cross-Validation (LOOCV) is used aluate the performance of LFEMDA and other methods. The experiment results reveal that o ethod is better than others, not only on the high-association diseases data, but also on the ne iseases data. The case study on breast neoplasms further demonstrates the extraordinary ability r method to predict the potential associations. In addition, the control experiment proves that o lculation of miRNA functional similarity is effective. Regarding these contributions, we believe th FEMDA is helpful in providing the potential candidates for subsequent research in the etiology an athogenesis of complex diseases.

Conclusions
In this paper, we present a miRNA-disease association prediction method using latent feature extraction with positive samples (LFEMDA). Leave-One-Out Cross-Validation (LOOCV) is used to evaluate the performance of LFEMDA and other methods. The experiment results reveal that our method is better than others, not only on the high-association diseases data, but also on the new diseases data. The case study on breast neoplasms further demonstrates the extraordinary ability of our method to predict the potential associations. In addition, the control experiment proves that our calculation of miRNA functional similarity is effective. Regarding these contributions, we believe that LFEMDA is helpful in providing the potential candidates for subsequent research in the etiology and pathogenesis of complex diseases.