SMMDA: Predicting miRNA-Disease Associations by Incorporating Multiple Similarity Profiles and a Novel Disease Representation

Simple Summary Predicting possible associations between miRNAs and diseases would provide new perspectives on disease diagnosis, pathogenesis, and gene therapy. In this work, considering the limited accessibility, high time consumption and high cost in traditional biological researches, we presented a novel computational method called SMMDA by incorporating multiple similarity profiles and a novel disease rep-resentation to accelerate the identification of potential miRNA-disease associations. SMMDA was intended to be useful for the prediction of associations between miRNAs and diseases, and to be effective for prevention, diagnosis, treatment and prognosis of Human diseases. Abstract Increasing evidence has suggested that microRNAs (miRNAs) are significant in research on human diseases. Predicting possible associations between miRNAs and diseases would provide new perspectives on disease diagnosis, pathogenesis, and gene therapy. However, considering the intrinsic time-consuming and expensive cost of traditional Vitro studies, there is an urgent need for a computational approach that would allow researchers to identify potential associations between miRNAs and diseases for further research. In this paper, we presented a novel computational method called SMMDA to predict potential miRNA-disease associations. In particular, SMMDA first utilized a new disease representation method (MeSHHeading2vec) based on the network embedding algorithm and then fused it with Gaussian interaction profile kernel similarity information of miRNAs and diseases, disease semantic similarity, and miRNA functional similarity. Secondly, SMMDA utilized a deep auto-coder network to transform the original features further to achieve a better feature representation. Finally, the ensemble learning model, XGBoost, was used as the underlying training and prediction method for SMMDA. In the results, SMMDA acquired a mean accuracy of 86.68% with a standard deviation of 0.42% and a mean AUC of 94.07% with a standard deviation of 0.23%, outperforming many previous works. Moreover, we also compared the predictive ability of SMMDA with different classifiers and different feature descriptors. In the case studies of three common Human diseases, the top 50 candidate miRNAs have 47 (esophageal neoplasms), 48 (breast neoplasms), and 48 (colon neoplasms) are successfully verified by two other databases. The experimental results proved that SMMDA has a reliable prediction ability in predicting potential miRNA-disease associations. Therefore, it is anticipated that SMMDA could be an effective tool for biomedical researchers.


Introduction
MicroRNAs (miRNAs) constitute a group of about 22 nucleotide long noncoding RNAs, prevalent in flora and fauna [1]. It acts as an essential regulatory factor of gene expressions that participate in degradation or post-transcriptional repression by supplementarily binding to corresponding 3 untranslated regions of their mRNA [2].
By targeting multiple transcripts, miRNAs play pivotal roles in biological processes, such as cell development [3][4][5], apoptosis [6], metabolism [7] and so on. Recently, an increasing amount of researches have revealed the effectiveness of microRNAs as prognostic biomarkers or important diagnostic and promising therapeutic targets for the treatment of malignant tumors [8]. The expression of hsa-miR-17-3p is altered in lung cancer from smokers and the methylation levels of hsa-miR-124-2 were reduced in SiHa cells [9]. The critical role of miRNAs in humans has attracted the attention of many researchers, and traditional in vitro experimental methods have been used to investigate the association between miRNAs and human diseases, and many significant results have been achieved. However, biological in vitro experiments require high human and financial costs and are not destined to study large-scale miRNA and disease data. In recent years, machine learning, deep learning, and other methods have improved and integrated bioinformatics problems. Accordingly, more and more researchers are trying to use methods such as machine learning to conduct miRNA-human disease studies.
Based on the hypothesis that interacting miRNA-disease pairs are more functionally similar and tend to be associated with the same miRNAs or diseases [10][11][12], computational models for predicting miRNA-disease associations have emerged in recent years. For example, Chen et al. [13] developed a heterogeneous label propagation method (HLPMDA) by propagating a heterogeneous label in the multiple networks of miRNAs, diseases, and lncRNAs to predict miRNA-disease associations. Ji et al. [10] focused on constructing a human biological association network using the association between miRNAs and diseases, and other biomolecules in the human body for predicting potential associations between miRNAs and diseases. In addition, this work also introduces graph representation learning methods and deep stacked autoencoder methods to obtain excellent prediction performance. Chen et al. [14] invented a bipartite network projection method (BNPMDA) by fusing integrated miRNA and disease similarity to predict miRNA-disease associations. In this work, a bipartite network recommendation method was applied to predict the potential associations between miRNAs and diseases.
In addition, machine learning approaches have been widely investigated in bioinformatics for predicting potential associations between miRNAs and diseases [15]. For example, Ji et al. [16] used a typical integrated learning approach, random forest, for the potential association of miRNAs with human diseases. They designed an attribute network embedding approach to construct a model with mighty predictive power by considering both the attribute features and network features using a typical integrated learning approach, random forest, for the potential association of miRNAs with human diseases. Zheng et al. utilized deep auto-encoder neural network (AE) and random forest classifier to predict potential miRNA-disease associations (MLMDA). Xu et al. [17] proposed a novel-method-based miRNA target-dysregulated network. Based on the changes and features in miRNA expression, they used SVM classifier to general predictive accuracy. Zhang et al. [18] utilized a variational auto-encoder approach for miRNA-disease association prediction, called VAEMDA. They constructed two spliced matrices by combining the integrated miRNA similarity and the integrated disease similarity with known miRNAdisease associations, respectively. This method prevents the noise created by the random selection of negative instances and shows miRNA-disease associations from the viewpoint of data distribution.
In this work, we presented a novel computational method called SMMDA by incorporating multiple similarity profiles and a novel disease representation to accelerate the identification of potential miRNA-disease associations. The flowchart of SMMDA to predict potential miRNA-disease associations was shown in Figure 1. In summary, the main contributions of this paper are as follows below.
Biology 2021, 10, x 3 of 13 predict potential miRNA-disease associations was shown in Figure 1. In summary, the main contributions of this paper are as follows below. Considering the limited accessibility, high time consumption, and high cost of traditional biological research, a novel computational model called SMMDA was proposed to accelerate the identification of potential associations between miRNAs and diseases.
The multiple similarity profiles of miRNAs and diseases and a novel disease representative feature were incorporated to predict potential miRNA-disease associations, enhancing predictive accuracy.
Deep learning is used for high-quality extraction of integrated features, and the gradient boosting method is used for fast and highly accurate training and prediction.
Compared with previous related works, the experiment results have proved the superior performance of SMMDA for predicting potential miRNA-disease associations.

Human miRNA-Disease Associations
The HMDD v3.0 database (Human MicroRNA Disease Database) [19] contains 1102 miRNAs and 850 diseases and 32,281 associations in 17,412 papers. In our experiments, the positive dataset contains 1057 miRNAs, 850 diseases and 32,226 associations. What was removed were association data considered unreliable by the public database miR-Base. In addition, we randomly selected 32,226 unrelated associations as the negative Considering the limited accessibility, high time consumption, and high cost of traditional biological research, a novel computational model called SMMDA was proposed to accelerate the identification of potential associations between miRNAs and diseases.
The multiple similarity profiles of miRNAs and diseases and a novel disease representative feature were incorporated to predict potential miRNA-disease associations, enhancing predictive accuracy.
Deep learning is used for high-quality extraction of integrated features, and the gradient boosting method is used for fast and highly accurate training and prediction.
Compared with previous related works, the experiment results have proved the superior performance of SMMDA for predicting potential miRNA-disease associations.

Human miRNA-Disease Associations
The HMDD v3.0 database (Human MicroRNA Disease Database) [19] contains 1102 miRNAs and 850 diseases and 32,281 associations in 17,412 papers. In our experiments, the positive dataset contains 1057 miRNAs, 850 diseases and 32,226 associations. What was removed were association data considered unreliable by the public database miRBase. In addition, we randomly selected 32,226 unrelated associations as the negative dataset, and it should be noted that these associations have been removed from the positive dataset.

miRNA Functional Similarity
Functional similarity between various miRNAs is a critical feature used for miRNAdisease association prediction, derived from the calculations of Wang et al. [20] They constructed a miRNA functional similarity score matrix (MF), available in http://www. cuilab.cn/files/images/cuilab/misim.zip (accessed on 1 March 2022), based on the principle that miRNAs with similar functions are more likely to be associated with diseases with Biology 2022, 11, 777 4 of 13 similar phenotypes. Finally, the similarity score between miRNA m 1 and miRNA m 2 can be expressed as MF(m 1 , m 2 ).

Gaussian Interaction Profile Kernel Similarity
Since miRNAs with similar functions are more likely to be associated with diseases with similar phenotypes and vice versa, we further calculated Gaussian interaction profile kernel similarity (GIP) for miRNAs and diseases [21]. In particular, an 850 rows and 1057 columns adjacency matrix was first constructed, with the rows in the matrix representing the number of miRNAs and the columns representing the number of diseases. The values of the elements in the matrix depend on whether there is an miRNA m i and disease d j association in the HMDD database; if it does, MD(m i , d j ) is equal to 1, otherwise it is equal to 0. The i-row vector of the adjacency matrix MD can be expressed as the binary vector MD(m i ), denoting the interaction profiles of miRNA m i . Based on the above definition, the GIP feature between miRNA m i and m j , GM(m i , m j ), is defined as follows: where δ m can be obtained by normalizing original parameter, which is the kernel bandwidth, as shown below: where m denotes the number of rows of the MD.
In the same way, the kernel similarity GD(d i , d j ) of the GIP similarity feature between disease d i and d j is defined as follow: where the total number of columns and i-column vector of the adjacent matrix MD are denoted by d and MD(d i ).

Disease Semantic Similarity
The U.S. National Library of Medicine classifies all human diseases and has constructed the Medical Subject Headings (MeSH) database. According to this database division, we can use a directed acyclic graph (DAG) to represent each disease. For example, we can use DAG(D) = (D, T(D), E(D)) to represent a disease D, where T(D) denotes node D and all its ancestor nodes, and E(D) denotes the set of edges associated with node D. Further, we defined the contribution of node d in DAG(D) to the semantic value of disease node D as: where ∆ is the semantic contribution factor [20,22]. From the above equation, we can get that if two diseases have a larger shared part, then their similarity scores are higher. Therefore, the semantic similarity scores between diseases d i and d j are shown below: Biology 2022, 11, 777 5 of 13

MeSHHeading2vec Method
The characterization of diseases is an important part for predicting miRNA-disease associations, which is directly related to the prediction accuracy of the model. More and more researchers are focusing on high-quality feature representation of diseases, and in this section, we utilize a novel computational method, namely MeSHHeading2vec [23]. This new disease representation method compares to traditional GIP similarity features and semantic similarity features of diseases has been shown to have an even better performance. Specifically, a relational network is first constructed which transforms the MeSH tree structure of the diseases, connecting the different disease MeSH headings. In addition, the method calculates the node and edge number in the network and provides a brief analysis of the distribution of labels of nodes and the degree of distribution, where the pattern of tree numbers corresponding to a node determines the label (category) of each node (MeSH heading). Finally, different network representation learning methods including DeepWalk [24], LINE [25], SDNE [26], HOPE [27], and LAP [28] are applied to this relational network thus obtaining high-quality network features of the disease and retainning the raw node related information and network structure. Based on the method, the LINE network representation method was chosen for high-quality disease network feature extraction to enhance the predictive power of SMMDA for potential miRNA-disease associations

Incorporating Multiple Similarity Profiles and a Novel Disease Representation
In this section, multiple miRNA similarity profile features, disease similarity profile features, and new high-quality disease representation features are incorporating. Specifically, the final matrix MFM(m i , m j ) of miRNA feature is defined as follows: where GM denotes miRNA GIP similarity and MF denotes miRNA functional similarity matrix. Similarly, the final disease feature matrix DFM(d i , d j ) is defined: where DM denotes the new high-quality disease representation feature, DS denotes the disease semantic similarity feature and GD denotes the disease Gaussian interaction profile kernel similarity feature.

Deep Auto-Encoder Learning Method
For eliminating noise and reduce dimension of original features, the deep auto-encoder method (DAE) [29] was used for improving prediction accuracy of miRNA-disease associations in our work. Specifically, we constructed the deep learning framework containing 7 fully connected layers as hidden layers, where the number of neurons, respectively, is (2 9 , 2 8 , 2 7 , 2 6 , 2 7 , 2 8 , 2 9 ), and the activation function for each layer uses the ReLU function. The first 3 hidden layers are the encoding part, the last 3 hidden layers are the decoding part, and the output of the middle layer is the final reduced dimensional feature data. First, the encoding part projects the original features f from the input layer to the hidden layer h1 using the mapping function y1. Secondly, the decoding part projects the hidden part h to the output layer h2 by a mapping function y2. Furthermore, the ReLU function is chosen as the activation function of AE in our work.

Exterme Gradient Boosting
In recent years, the Exterme Gradient Boosting (XGBoost) proposed by Chen et al. is widely used by researchers and has yielded satisfactory results. XGBoost is a new classifier based on classification and regression trees integration (CART) and utilizes gradient boosting to optimize trees [30].
Set the output of a tree as shown below: where W q is the score of the leaf nodel q and x i is the input vector. On the basis, the output of the set of K trees is: The objective function O at step t of XGBoost method is: where L is the train loss function between the output y and real y, the second term in the function is for regularization. Moreover, the complexity of the XGBoost method is defined as follows: where γ is the pseudo-regularization hyperparameter, T is the total number of leaf nodes and λ is the L2 norm for leaf weights. For detecting the optimal weights W, the gradient is used to conduct second-order approximation to the loss function, and the optimal value of the objective function is where I is the set of leaf nodes, g i and h i are the gradient statistics on the loss function, given by:

The Detailed Prediction Performance of SMMDA
To accurately assess the predictive power of SMMDA for potential miRNA-disease associations, the more widely adopted five-fold cross-validation method was utilized. The method was repeated five times by randomly shuffling the samples and dividing them evenly into five parts, with one part as the test dataset and the remaining four groups as the training dataset. The detailed results of the experiments are recorded in Table 1, containing six commonly used predictive metrics, namely accuracy (Acc.), precision (Prec.), sensitivity (Sen.), Mathews correlation coefficient (MCC), and areas under the ROC curve (AUC). From the experimental results, we can see that SMMDA achieved a mean accuracy of 86.68% with a standard deviation of 0.42%, which is a good proof of the excellent performance of SMMDA. For the AUC metric, which is more indicative of the model's predictive power, SMMDA obtained a mean of 94.06% with a standard deviation of 0.23% under five-fold cross-validation.

Comparison of Different Feature Combinations
To further assess the capability of our proposed feature descriptors, we compared them with different descriptors. In particular, the feature descriptors in our work is generated by fusing a novel disease representation, miRNA functional similarity, disease semantic similarity, and GIP kernel similarity information of miRNAs and diseases. Furthermore, a different feature descriptor is generated by only fusing miRNA functional similarity, disease semantic similarity, and GIP kernel similarity information of miRNAs and diseases (DescSim). The detailed results of the feature descriptors DescSim under 5-fold crossvalidation were shown in Table 2. The results that our feature descriptors have a better performance than the feature descriptors used in many previous methods which only fuse similarity information to predict underlying miRNA-disease associations.

Comparison of Different Classifier Methods
In order to select the best predictive classifier method for SMMDA model, we conducted, respectively, the five-fold cross-validation experiment using different classifier methods including decision tree (DT) [31], logistic regression (LR) [32], random forest (RF) [33], and Extreme Gradient Boosting (XGBoost). It is worth noting that all experiments adopt the same environment and different classification methods adopt default training parameters to ensure the fairness and ease of operation of the comparison experiment. The average results of different classifier methods were displayed in Table 3. The AUC values and ROC curves, AUPR values and PR curves was respectively shown in the Figure 2. The comparison experiment demonstrates that XGBoost has a better performance than the other methods. Therefore, it is more suitable for SMMDA models.

Comparison of Previous Related Works
To further demonstrate the good performance of SMMDA, we compared 10 previous start-of-the-art computational models, namely DANE-MDA [16], MLMDA [34], MTDN [17], VAEMDA [18], LMTRDA [35], DBMDA [36], WBSMDA [37], PBMDA [38], HDMP [39], RLSMDA [40]. Furthermore, the data sets used by all these models are from the HMDD database. Here we selected the results of average AUC under five-fold cross-validation experiment as evaluation indicators. As shown in Table 4, SMMDA has a higher mean AUC value in the experiment, which proves its superior performance in the field of miRNA-disease association prediction.

Case Studies
To further evaluate whether SMMDA could perform accurately and robustly, we select three complex Human diseases for case studies including colon neoplasms, breast

Comparison of Previous Related Works
To further demonstrate the good performance of SMMDA, we compared 10 previous start-of-the-art computational models, namely DANE-MDA [16], MLMDA [34], MTDN [17], VAEMDA [18], LMTRDA [35], DBMDA [36], WBSMDA [37], PBMDA [38], HDMP [39], RLSMDA [40]. Furthermore, the data sets used by all these models are from the HMDD database. Here we selected the results of average AUC under five-fold cross-validation experiment as evaluation indicators. As shown in Table 4, SMMDA has a higher mean AUC value in the experiment, which proves its superior performance in the field of miRNAdisease association prediction.

Case Studies
To further evaluate whether SMMDA could perform accurately and robustly, we select three complex Human diseases for case studies including colon neoplasms, breast neoplasms, and esophageal neoplasms. Specifically, the known miRNA-disease associations in HMDD v3.0 [19] are selected as the training samples, and candidate miRNAs for evaluated diseases are ranked in compliance with the predictive scores provided by SMMDA. It is important to note that we have deleted the associations that have been verified in the HMDD v3.0 database to ensure that the validation data set is not correlated with the data set already used for training. Finally, we confirmed the top 50 predicted miRNA-disease associations with the dbDEMC [41] and miR2Disease [42] databases.
Colon neoplasms are cancers that begin in the final part of the digestive tract (colon). It can occur at any age, but the incidence is higher in the elder people. Colon neoplasms usually start as non-cancerous (benign) small cell clumps, called polyps, which form inside the colon. Overtime, a few polyps will become colon cancer. Hence, doctors recommend regular screening to identify and remove polyps before they become cancer, which can help prevent colon cancer. The SMMDA model was utilized to predict potential miRNAesophageal-neoplasm associations. In the result, 47 of the top 50 predicted miRNAs are identified in the databases (see Table 5). Breast neoplasms are cancers that occur in the breast cells. It is the most common cancer diagnosed in women in the United States, second only to skin cancer [43][44][45]. Breast neoplasms can occur in both men and women, but are much more severe in women. In recent years, the survival rates of breast neoplasms have increased largely due to factors such as a better understanding of the disease and earlier detection. In this article, SMMDA was utilized to predict potential miRNA-breast neoplasms associations. Finally, 48 of the top 50 predicted miRNAs are identified in the databases (see Table 6).  Esophageal Neoplasms are a serious digestive disease with a high death rate [46][47][48]. It is the sixth most common cause of cancer death worldwide. The incidence of it varies from place to place. In some areas, the higher incidence of esophageal neoplasms may be due to smoking and alcohol consumption or special nutritional habits and obesity [49,50]. In this article, SMMDA was utilized to predict potential miRNA-esophageal neoplasms associations. Finally, 48 of the top 50 predicted miRNAs are identified in the databases (see Table 7).

Conclusions
Recently, machine-learning approaches have been widely investigated in the field of bioinformatics including the prediction of potential associations between miRNAs and diseases. In this work, considering the limited accessibility, high time consumption and high cost in traditional biological researches, we presented a novel computational method called SMMDA by incorporating multiple similarity profiles and a novel disease representation to accelerate the identification of potential miRNA-disease associations. The multiple similarity profiles of miRNAs and diseases and a novel disease representative feature were incorporating, thereby enhancing predictive accuracy. The deep learning is used for high-quality extraction of integrated features and gradient boosting method is used for fast and highly accurate training and prediction. Compared with previous related works, the experiment results have proved that the superior performance of SMMDA. The comparison experiment of different classifiers and different feature descriptors further proved that the