MDSCMF: Matrix Decomposition and Similarity-Constrained Matrix Factorization for miRNA–Disease Association Prediction

MicroRNAs (miRNAs) are small non-coding RNAs that are related to a number of complicated biological processes, and numerous studies have demonstrated that miRNAs are closely associated with many human diseases. In this study, we present a matrix decomposition and similarity-constrained matrix factorization (MDSCMF) to predict potential miRNA–disease associations. First of all, we utilized a matrix decomposition (MD) algorithm to get rid of outliers from the miRNA–disease association matrix. Then, miRNA similarity was determined by utilizing similarity kernel fusion (SKF) to integrate miRNA function similarity and Gaussian interaction profile (GIP) kernel similarity, and disease similarity was determined by utilizing SKF to integrate disease semantic similarity and GIP kernel similarity. Furthermore, we added L2 regularization terms and similarity constraint terms to non-negative matrix factorization to form a similarity-constrained matrix factorization (SCMF) algorithm, which was applied to make prediction. MDSCMF achieved AUC values of 0.9488, 0.9540, and 0.8672 based on fivefold cross-validation (5-CV), global leave-one-out cross-validation (global LOOCV), and local leave-one-out cross-validation (local LOOCV), respectively. Case studies on three common human diseases were also implemented to demonstrate the prediction ability of MDSCMF. All experimental results confirmed that MDSCMF was effective in predicting underlying associations between miRNAs and diseases.


Introduction
MiRNAs are 17-24 nt non-coding RNAs that play a pivotal role in controlling the expression of genes through RNA cleavage or translation repression [1][2][3]. Lin-4 was the first miRNA inspected experimentally, by Lee et al. [4] in 1993. Since that time, a large number of miRNAs have been discovered experimentally by researchers [4][5][6]. Researchers have found that various miRNAs are bound up with several crucial biological processes, such as cell development, cell differentiation, cell proliferation, etc. [7][8][9][10]. Developmental defects can be the result of the dysregulation of miRNAs that are associated with the progression of diseases [11]. In the meantime, numerous studies have indicated that miRNAs are connected with a serious of human neoplasms, including breast neoplasms, lung neoplasms, prostate neoplasms, etc. [12][13][14]. Hence, distinguishing miRNAs associated with diseases can deepen the understanding of the genetic causes of complex diseases. Strong connections between miRNAs and diseases have been found by a variety of traditional experiments in the past few years [15,16]. Traditional manual models can infer the associations between miRNAs and diseases, but these are time-consuming, laborious, and have a high failure rate. Therefore, showing the potential relationships between miRNAs and diseases requires effective and stable computational methods, which can obtain increasingly reliable miRNA-disease associations. the network is very important. Researchers have proposed many design methods for the initial state of nodes in recent years, but the prediction performance has not been greatly improved.
As artificial intelligence technology has developed, machine-learning-based models have increasingly been employed for the accurate prediction of miRNA-disease relationships. To obtain accurate results in matrix completion for miRNA-disease association prediction, Li et al. [34] avoided using negative samples in MCMDA. To infer unknown miRNA-disease interactions, the probabilistic matrix factorization (PMF) algorithm was applied [35] to make predictions. The PMF algorithm is a machine learning technique commonly employed in recommender systems, and can effectively utilize all available data to recommend miRNAs linked to the disease in question. Ha et al. [36] utilized a matrix completion with network regularization method to recognize potential disease-related miRNAs. They considered an miRNA network as additional implicit feedback, and made predictions for disease associations with a given miRNA relying on its direct neighbors. Guo et al. [37] introduced MLPMDA-a novel model for predicting miRNA-disease associations using multilayer linear projection. They processed miRNA-disease interaction information by processing the top nearest neighbors of entities, and then used the updated miRNA-disease interactions and disease similarity to construct a heterogeneous matrix. In this heterogeneous matrix, the multilayer projection and layer-stacking strategy were employed to make predictions. However, in order to obtain dependable and steady performance, MLPMDA requires high-quality biological data. Ding et al. [38] presented a novel computational model named VGAMF for predicting miRNA-disease associations. VGAMF first integrated several different types of information about miRNAs and diseases into comprehensive similarity networks of miRNAs and diseases, which were used to extract the nonlinear representations of miRNAs and diseases based on the variational graph autoencoders. Then, VGAMF obtained the linear representations of miRNAs and diseases by implementing non-negative matrix factorization to process the miRNA-disease association matrix. Finally, a fully connected neural network combined linear representations with nonlinear representations to generate the predicted miRNA-disease association scores. Wang et al. [39] presented a novel method called NMCMDA to observe unknown disease-related miRNAs. The encoder and decoder were the two essential components in NMCMDA. The encoder was developed using a graph neural network to extract latent miRNA and disease characteristics from a heterogeneous miRNA-disease network. These latent features were used by the decoder to generate miRNA-disease association scores. For NMCMDA, a variety of encoders and decoders have been proposed. Finally, in NMCMDA, the combination of a relational graph convolutional network encoder and a neural multirelational decoder achieved the best prediction results. In summary, machine-learning-based models can produce more accurate prediction results, but most of them have difficulties in adjusting the optimal parameters and selecting negative samples, which seriously affect the training efficiency of the model.
Despite their outstandingly good performance, the abovementioned prediction models have several limitations, such as inadequate measurement of similarity, excessive noise in experimental data, and inaccurate prediction results. To overcome these limitations, we present a novel model called MDSCMF, which combines matrix decomposition with similarity-constrained matrix factorization to predict unobserved miRNA-disease associations. To construct information-rich miRNA similarity and disease similarity, we applied SKF to integrate various kinds of miRNA similarity data and disease similarity data. In addition, because the unknown miRNA-disease associations were much more numerous than the known associations, an MD algorithm was used to get rid of outliers from the miRNA-disease association matrix. Furthermore, we added L 2 regularization terms and similarity constraint terms to non-negative matrix factorization to form an SCMF algorithm, which was implemented to obtain the final association scores of each miRNA-disease pair. To evaluate the effectiveness of MDSCMF, 5-CV, global LOOCV, and local LOOCV were carried out on the known miRNA-disease association data downloaded from HMDD v3.2 [40]. Furthermore, we performed case studies on colon neoplasms, breast neoplasms, and lung neoplasms for prediction. As a result, 29, 29, and 28 out of the top 30 miRNAs potentially connected to these high-risk human diseases, respectively, were confirmed by miR2Disease [41] and dbDEMC v2.0 [42]. Experimental results showed that MDSCMF was effective for inferring possible relationships between miRNAs and diseases.

Performance Evaluation
In this section, based on the verified associations between miRNAs and diseases in the HMDD v3.2 database, 5-CV, global LOOCV, and local LOOCV were implemented to evaluate the prediction performance of MDSCMF.
In the framework of 5-CV, we compared MDSCMF with other previous computational methods, including GCAEMDA [43], MSCHLMDA [44], NIMCGCN [45], and HFHLMDA [46]. The full set of verified miRNA-disease associations were divided into five parts in a random manner, where the test set was held by each part in turn, while the training set consisted of the other four parts. The full set of unknown miRNA-disease associations were considered as candidate samples. We applied our method to determine the ranking of the test set relative to candidate samples. Furthermore, for the purpose of reducing potential deviations resulting in random sample segmentations, we applied 100 repeated segmentations to verify the miRNA-disease associations. When the ranking of all test samples was higher than a certain threshold, MDSCMF was regarded as a valid method. Then we could utilize the receiver operating characteristic (ROC) curve that was obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) to effectively evaluate the performance of MDSCMF. We could calculate the areas under the ROC curve (AUCs) of these methods, whose values were between 0 and 1. Figure 1 indicates that MDSCMF, GCAEMDA, MSCHLMDA, NIMCGCN, and HFHLMDA had AUC values of 0.9488, 0.9415, 0.9324, 0.9378, and 0.9301, respectively. The AUC value of MDSCMF was clearly higher than that of the other methods.
from the miRNA-disease association matrix. Furthermore, we added regularization terms and similarity constraint terms to non-negative matrix factorization to form an SCMF algorithm, which was implemented to obtain the final association scores of each miRNA-disease pair. To evaluate the effectiveness of MDSCMF, 5-CV, global LOOCV, and local LOOCV were carried out on the known miRNA-disease association data downloaded from HMDD v3.2 [40]. Furthermore, we performed case studies on colon neoplasms, breast neoplasms, and lung neoplasms for prediction. As a result, 29, 29, and 28 out of the top 30 miRNAs potentially connected to these high-risk human diseases, respectively, were confirmed by miR2Disease [41] and dbDEMC v2.0 [42]. Experimental results showed that MDSCMF was effective for inferring possible relationships between miRNAs and diseases.

Performance Evaluation
In this section, based on the verified associations between miRNAs and diseases in the HMDD v3.2 database, 5-CV, global LOOCV, and local LOOCV were implemented to evaluate the prediction performance of MDSCMF.
In the framework of 5-CV, we compared MDSCMF with other previous computational methods, including GCAEMDA [43], MSCHLMDA [44], NIMCGCN [45], and HFHLMDA [46]. The full set of verified miRNA-disease associations were divided into five parts in a random manner, where the test set was held by each part in turn, while the training set consisted of the other four parts. The full set of unknown miRNA-disease associations were considered as candidate samples. We applied our method to determine the ranking of the test set relative to candidate samples. Furthermore, for the purpose of reducing potential deviations resulting in random sample segmentations, we applied 100 repeated segmentations to verify the miRNA-disease associations. When the ranking of all test samples was higher than a certain threshold, MDSCMF was regarded as a valid method. Then we could utilize the receiver operating characteristic (ROC) curve that was obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) to effectively evaluate the performance of MDSCMF. We could calculate the areas under the ROC curve (AUCs) of these methods, whose values were between 0 and 1. Figure 1 indicates that MDSCMF, GCAEMDA, MSCHLMDA, NIMCGCN, and HFHLMDA had AUC values of 0.9488, 0.9415, 0.9324, 0.9378, and 0.9301, respectively. The AUC value of MDSCMF was clearly higher than that of the other methods.  In the framework of global LOOCV, MDSCMF was also compared with GCAEMDA, MSCHLMDA, NIMCGCN, and HFHLMDA. The test set was held by each verified miRNAdisease association in turn, while the training set was composed of the other verified associations. The full set of unknown miRNA-disease associations were considered as candidate samples. In addition, we applied MDSCMF to obtain all predicted association scores so that the ranking of the test set relative to the candidate samples could be determined. Similar to 5-CV, we also calculated the AUCs of these methods so as to effectively evaluate their performance. From Figure 2, we can see that MDSCMF, GCAEMDA, MSCHLMDA, NIMCGCN, and HFHLMDA had AUC values of 0.9540, 0.9505, 0.9378, 0.9410, and 0.9321, respectively. Hence, the AUC value of MDSCMF was also higher than that of the other methods.
also higher than that of the other methods.
In the framework of local LOOCV, we also compared MDSCMF with other previous models (i.e., RFMDA [47], BNPMDA [48], ABMDA [49] and VGAMF [38]) to objectively evaluate its performance. In this way, we could determine the ability of MDSCMF to predict the associations between miRNAs and diseases without any verified related miRNAs. For random diseases in the HMDD v3.2 database, the confirmed associations between each disease and all miRNAs were considered as the test set, and remaining associations were regarded as the training set. Similar to the previous two cross-validation methods, the AUC value in local LOOCV still served as the evaluation criterion to reflect the ability of these models. The specific results are shown in Figure 3, which shows that the prediction performance of MDSCMF was better than that of the other models.  In the framework of local LOOCV, we also compared MDSCMF with other previous models (i.e., RFMDA [47], BNPMDA [48], ABMDA [49] and VGAMF [38]) to objectively evaluate its performance. In this way, we could determine the ability of MDSCMF to predict the associations between miRNAs and diseases without any verified related miRNAs. For random diseases in the HMDD v3.2 database, the confirmed associations between each disease and all miRNAs were considered as the test set, and remaining associations were regarded as the training set. Similar to the previous two cross-validation methods, the AUC value in local LOOCV still served as the evaluation criterion to reflect the ability of these models. The specific results are shown in Figure 3, which shows that the prediction performance of MDSCMF was better than that of the other models.

Parameter Analysis
In this section, the parameters and were quantitatively analyzed to research their effects on the prediction performance. and were set as the regularization parameters, which were applied to control the overfitting degree and the smoothness of similarity consistency, respectively. We utilized all combinations of two values ∈ {2 , 2 , … , 2 } and ∈ {2 , 2 , … , 2 } to conduct MDSCMF. The AUC values of 5-CV were applied to evaluate the performance of the model under different combinations of parameters. After various tests were conducted, we concluded that the model obtained

Parameter Analysis
In this section, the parameters ϑ and σ were quantitatively analyzed to research their effects on the prediction performance. ϑ and σ were set as the regularization parameters, which were applied to control the overfitting degree and the smoothness of similarity consistency, respectively. We utilized all combinations of two values ϑ ∈ 2 −3 , 2 −2 , . . . , 2 3 and σ ∈ 2 −3 , 2 −2 , . . . , 2 3 to conduct MDSCMF. The AUC values of 5-CV were applied to evaluate the performance of the model under different combinations of parameters. After various tests were conducted, we concluded that the model obtained the best performance when ϑ = 2 2 and σ = 2 0 , as shown in Figure 4.

Parameter Analysis
In this section, the parameters and were quantitatively analyzed to research their effects on the prediction performance. and were set as the regularization parameters, which were applied to control the overfitting degree and the smoothness of similarity consistency, respectively. We utilized all combinations of two values ∈ {2 , 2 , … , 2 } and ∈ {2 , 2 , … , 2 } to conduct MDSCMF. The AUC values of 5-CV were applied to evaluate the performance of the model under different combinations of parameters. After various tests were conducted, we concluded that the model obtained the best performance when = 2 and = 2 , as shown in Figure 4.

Effects of Matrix Decomposition Analysis
In this section, we evaluated the effect of the pre-processing MD step for known miRNA-disease association matrix on the model's performance. The AUC values of 5-CV were considered as indicators, and the corresponding ROC curves are shown in Figure 5. In MDSCMF, MD considers the sparsity of the miRNA-disease association matrix, thereby improving the prediction ability of the model. Conversely, MDSCMF without MD disregards the sparsity of the original association matrix; thus, the noise data in the matrix may reduce the accuracy of the prediction. As shown in Figure 5, the AUC value of MDSCMF under the 5-CV framework was 0.9488. In contrast, the AUC value of MDSCMF without MD under the 5-CV framework was 0.9291. The results of the

Effects of Matrix Decomposition Analysis
In this section, we evaluated the effect of the pre-processing MD step for known miRNA-disease association matrix A on the model's performance. The AUC values of 5-CV were considered as indicators, and the corresponding ROC curves are shown in Figure 5. In MDSCMF, MD considers the sparsity of the miRNA-disease association matrix, thereby improving the prediction ability of the model. Conversely, MDSCMF without MD disregards the sparsity of the original association matrix; thus, the noise data in the matrix may reduce the accuracy of the prediction. As shown in Figure 5, the AUC value of MDSCMF under the 5-CV framework was 0.9488. In contrast, the AUC value of MDSCMF without MD under the 5-CV framework was 0.9291. The results of the comparison distinctly show that MDSCMF with MD has a higher AUC value compared to that without MD. comparison distinctly show that MDSCMF with MD has a higher AUC value compared to that without MD.

Case Studies
For the purpose of demonstrating the effectiveness and accuracy of MDSCMF, we applied an evaluation experiment in this study. We implemented several types of human diseases-i.e., colon neoplasms, breast neoplasms, and lung neoplasms-as case studies to validate the performance of our method. Colon neoplasms are malignancies in the

Case Studies
For the purpose of demonstrating the effectiveness and accuracy of MDSCMF, we applied an evaluation experiment in this study. We implemented several types of human diseases-i.e., colon neoplasms, breast neoplasms, and lung neoplasms-as case studies to validate the performance of our method. Colon neoplasms are malignancies in the field of medicine that have been confirmed to be associated with several miRNAs [50,51]. Breast neoplasms, which have been observed to be associated with several miRNAs in clinical experiments, have a high incidence rate among women [52]. Lung neoplasms are among the most dangerous malignancies, with the fastest increases in morbidity and mortality [53]. A growing body of evidence indicates that these diseases have close relationships with several miRNAs. The miRNAs associated with these diseases were ranked in line with the prediction scores. Moreover, we utilized two databases-miR2Disease [41] and dbDEMC v2.0 [42]-to check miRNAs that had been ranked.
As a result, 29, 29, and 28 of the top 30 miRNAs inferred by our model were individually confirmed to be associated with colon neoplasms, breast neoplasms, and lung neoplasms, respectively, according to the miR2Disease [41] and dbDEMC v2.0 [42] databases. Tables 1-3 show the corresponding prediction results.

Materials and Methods
In this paper, we utilized the biological information of miRNAs and diseases to propose a novel method called MDSCMF, which fully extends the advantages of matrix decomposition and similarity-constrained matrix factorization to predict possible miRNAdisease associations. The flowchart of MDSCMF is clearly shown in Figure 6.

Human miRNA-Disease Associations
In this study, we took advantage of miRNA-disease association data HMDD v3.2 database [40], which contained 12,446 verified associations be miRNAs and 591 diseases. To make calculation more convenient, we constru jacency matrix ∈ × to indicate the miRNA-disease relationships. We to stand for the numbers of diseases and miRNAs, respectively. Specific  (27) Updating Rules Equation (34) and (35) = disease miRNA Figure 6. Flowchart of MDSCMF.

Human miRNA-Disease Associations
In this study, we took advantage of miRNA-disease association data from the HMDD v3.2 database [40], which contained 12,446 verified associations between 853 miRNAs and 591 diseases. To make calculation more convenient, we constructed an adjacency matrix A ∈ R nm×nd to indicate the miRNA-disease relationships. We set nd and nm to stand for the numbers of diseases and miRNAs, respectively. Specifically, the element A(i, j) is equal to 1 when miRNA m i is proved to be connected with disease d j , and otherwise it is equal to 0. Therefore, the matrix A contains 12,446 entries that are equal to 1.

MiRNA Functional Similarity
The miRNAs with similar functions have a high probability of being related to diseases that are similar, and vice versa [20]. Therefore, we downloaded the miRNA functional similarity data from http://www.cuilab.cn/files/images/cuilab/misim.zip, accessed on 1 June 2022. For ease of calculation, we constructed the matrix SM 1 to store the data. The element SM 1 m i , m j represents the value of similarity between miRNA m i and miRNA m j .

Disease Semantic Similarity
The directed acyclic graph (DAG) based on the MeSH descriptor [54] can be utilized to describe diseases. DAG(D) = (D, T(D), E(D)) represents the DAG of disease D. T(D) denotes the nodes in the DAG that include D itself and its ancestor nodes. E(D) denotes the edges in the DAG that connect child nodes with parent nodes directly. The formula to calculate the semantic score of disease D is defined as follows: where the formula to calculate the contribution value D D 1(d) of disease d is as follows: where ∆ is the semantic contribution factor, which was equal to 0.5 in our paper, based on previous literature [55]. The formula to obtain the semantic similarity score between disease d i and disease d j is defined as follows: Furthermore, for the two diseases of the same layer in a DAG, assuming they have different occurrences in DAGs, it does not make sense to define the semantic contributions of the two diseases for this DAG to be consistent. Objectively speaking, the semantic contribution of high-incidence diseases should be less than that of low-incidence diseases. Consequently, to further optimize the similarity information between diseases, another strategy was introduced to calculate disease semantic similarity following this method [56]. Specifically, the formulae to calculate the semantic score of disease D and the contribution values of disease d are as follows: Then, the formula to obtain the semantic similarity score between d i and disease d j is as follows: For the purpose of making the results more accurate, we set two kinds of semantic similarity that were equally important. Therefore, if disease d i and d j had semantic simi-larity, we calculated the average SD 1 d i , d j of SS 1 d i , d j and SS 2 d i , d j by the following formula:

Gaussian Interaction Profile Kernel Similarity
The miRNAs with similar functions have a high probability of being related to similar diseases, and vice versa [20]. Therefore, the Gaussian interaction profile kernel similarity was applied to determine the miRNA similarity and disease similarity [57,58]. We made vector K(d i ) to represent the interaction profile of disease d i in accordance with whether or not d i had a verified association with each miRNA. Similarly, we made vector K(m i ) to represent the interaction profile m i in accordance with whether or not m i had a verified association with each disease. The equation to calculate the GIP kernel similarity of diseases is defined as follows: where ρ d is applied to control the kernel bandwidth. The ρ d is obtained by normalizing the original bandwidth ρ d by the average number of verified associations with miRNAs per disease, as follows: Similarly, we used the following equations to calculate the GIP kernel similarity of miRNAs:

Integrating Similarity for miRNAs and Diseases
In this section, the similarity kernel fusion [59] was implemented to integrate miRNA functional similarity and GIP kernel similarity into ultimate miRNA similarity. The concrete integration process of miRNA similarity matrices can be divided into the following major steps: In the first step, two different miRNA similarities are treated as original miRNA similarity kernels, which are defined as SM n , n = 1, 2 in the above sections. Each miRNA similarity is normalized by the following equation: where F n m i , m j denotes the normalized kernel that satisfies ∑ m k ∈M F n m i , m j = 1, and indicates the set of miRNAs. In the second step, the neighbor-constraint kernel for each miRNA original kernel can be constructed as follows: where S n m i , m j denotes a neighbor-constraint kernel that obeys ∑ m k ∈M S n m i , m j = 1, and N i denotes the collection of all neighbors of miRNA m i , including itself.
In the third step, the normalized kernels and neighbor-constraint kernels are integrated as follows: where F l+1 n represents the value of n-th kernel after l + 1 iterations, P 0 t represents the initial value of F t , and the weight parameter τ ∈ (0, 1) is used to balance the rate. After F l+1 n , n = 1, 2 is obtained, the overall kernel SM * can be calculated by the following formula: In the fourth step, a weighted matrix W is applied to further eliminate noises in the overall kernel SM * . The construction process of W is as follows: In the last step, the ultimate miRNA similarity kernel SM ∈ R nm×nm can be calculated by the following formula: In the same light, we could obtain the ultimate disease similarity kernel as SD ∈ R nd×nd .

Matrix Decomposition
From the published literature [60], we found that the data used in experiments were far from perfect. Several real data of miRNA-disease associations were redundant and/or missing. Therefore, we decomposed the adjacency matrix A into two sections: The linear combination of the adjacency matrix A and low-rank matrix Y was the first section. The second section was the sparse matrix X, which included a large number of zero values. Clearly, the data of the sparse matrix X can be regarded as outliers. The matrix decomposition method was applied to acquire the lowest-rank matrix, which was employed to reconstruct a novel adjacency matrix. The formula to decompose the adjacency matrix A is defined as follows: For the purpose of making the Y become low-rank, we could enforce nuclear norm on Y. In addition, the L 2,1 norm was enforced on the X so that X became sparse. The specific process can be represented by the following formula: where Y * = ∑ i β i (i.e., β i is the sigular values o f Y) represents the nuclear norm of Y, X ij 2 represents the L 2,1 norm of X, and the positive free parameter ϕ is applied to balance the weights of Y * and X 2,1 . Furthermore, minimizing the nuclear norm of Y and the L 2,1 norm of X contributes to convenient calculation. If the matrix A combined with Y is treated as an identity matrix, the algorithm will become the robust PCA. Therefore, Equation (19) can be treated as a robust PCA generalization [61] and changed into a comparable problem, as follows: Equation (20) is a constraint and convex optimization problem. Therefore, both the first-order information and the special properties of these convex optimization problems can be employed to solve the issue of scalability. The inexact augmented Lagrange multipliers (IALM) algorithm [62] can be utilized to convert Equation (20) to an unconstraint problem. Then, the augmented Lagrange function is adopted to minimize this problem, as follows: where the penalty parameter α ≥ 0. After minimization with respect to J, Y, and X, the above problem can be settled effectively. In addition, Equation (21) can be solved when the other variables are fixed and the Lagrange multipliers F 1 and F 2 are updated. The specific steps for solving Equation (21) are displayed in Algorithm 1.
3: Fix the other and update X by X = argmin We defined the solution of Equation (21) as Y * and X * . The A(i, j) was used to represent the association between miRNA m i and disease d j , so Y * ∈ R nd×nd could be applied to represent the similarity between diseases. When Y * was obtained, the adjacency matrix A * denoted new associations between miRNAs and diseases that could be calculated by the following equation:

Similarity-Constrained Matrix Factorization
In this section, the L 2 regularization terms and similarity constraint terms were added to a traditional non-negative matrix factorization algorithm to form similarity-constrained matrix factorization, which was applied to observe more unknown miRNA-disease interactions. The matrix A * ∈ R nm×nd can be factorized into U ∈ R nm×γ and V ∈ R nd×γ , where γ represents the dimensions of miRNA features and disease features. Concretely, the miRNA-disease association can be regarded as the inner product between the miRNA feature vector and the disease feature vector: a * ij ≈ u i v T j , where a * ij indicates the (i, j)th element of matrix A * , while u i and v j indicate the ith row of U and the jth row of V, respectively. The corresponding objective function is defined as follows: In what follows, the L 2 regularization terms of u i and v j are added to above function for preventing overfitting in the model: where ϑ denotes the regularization parameter for controlling the balance.
When data points are mapped from high-rank space into low-rank space, the geometric properties of the data points will most likely stay the same [63,64]. Owing to the miRNA similarity SM and disease similarity SD being able to represent the geometric structure of the data points, the similarity constraint terms S U and S V are proposed as follows: where SM ij represents the similarity between miRNAs m i and m j , while SD ij denotes the similarity between diseases d i and d j . Because the degree of similarity between two random data points is determined by the distance between them, S U will incur a heavy penalty if the distance between m i and m j is close in the miRNA feature space. Thus, we minimized the S U to keep the geometric structure of the miRNA data points, which would give rise to m i and m j being mapped closely in low-dimensional space. The same is true for disease data nodes, so we also minimized the S V . Based on the above analysis, the objective function of SCMF can be defined by adding S U and S V to Equation (24) as follows: where σ denotes the hyperparameter to control the smoothness degree of similarity consistency. Subsequently, an efficacious optimization algorithm is proposed to calculate the objective function of SCMF. First, the partial derivatives of L with respect to u i and v j can be calculated by the following formulae: where A * (i, :) and A * (:, j) indicate the ith row and jth column of matrix A * , respectively. Next, the calculation of the second derivatives of L with respect to u i and v j is presented as follows: Then, u i and v j can be iteratively updated according to Newton's method, as follows: More specifically, the update of u i and v j can be performed by the below formulas: The update of u i and v j will stop when the convergence condition is satisfied. After that, the predicted association matrix can be calculated by the following formula: The value of A ij denotes the predicted association score between miRNA m i and disease d j . The higher the prediction score, the greater the association probability.

Discussion
To solve the problems of inadequate measurement of similarity, excessive noise in experimental data, and inaccurate prediction results existing in previous prediction models, we developed a computational model for predicting miRNA-disease associations based on matrix decomposition and similarity-constrained matrix factorization (MDSCMF). Because the miRNA-disease association matrix was a sparse matrix, we applied the MD algorithm to complete it. Our results demonstrated that the MD algorithm could improve the prediction performance to some extent. In addition, we applied SKF to integrate various types of similarities for constructing information-rich miRNA similarity and disease similarity. Furthermore, L 2 regularization terms and similarity constraint terms were added to nonnegative matrix factorization to form the SCMF algorithm, which was utilized to generate association scores of each miRNA-disease pair. In the frameworks of 5-CV, global LOOCV, and local LOOCV, the AUCs of MDSCMF achieved 0.9488, 0.9540, and 0.8672, respectively, indicating that the performance of our method had a significant improvement relative to previous methods. Furthermore, the predicted miRNAs related to colon neoplasms, prostate neoplasms, and lung neoplasms were confirmed by the experimental literature, so the prediction results generated by our method were proven to be reliable.
It should be noted that the following factors may contribute to the reliable performance of MDSCMF: First of all, the MD algorithm, which greatly alleviated the influence of the inherent noise existing in the current dataset, was utilized to refine the miRNA-disease association matrix. In addition, when we used SCMF to make predictions, the L 2 regularization terms and similarity constraint terms could avoid overfitting problems and generate robustness of the data richness, respectively.
However, several limitations may influence the performance of MDSCMF. First of all, although the amount of data had increased, we still ought to spare no effort to expand the experimental data. Furthermore, the data we utilized included miRNA function similarity data and disease semantic similarity data, which may contain noise and outliers. Therefore, we should continuously optimize our model to improve its performance in the future.