Identification of MiRNA–Disease Associations Based on Information of Multi-Module and Meta-Path

Cumulative research reveals that microRNAs (miRNAs) are involved in many critical biological processes including cell proliferation, differentiation and apoptosis. It is of great significance to figure out the associations between miRNAs and human diseases that are the basis for finding biomarkers for diagnosis and targets for treatment. To overcome the time-consuming and labor-intensive problems faced by traditional experiments, a computational method was developed to identify potential associations between miRNAs and diseases based on the graph attention network (GAT) with different meta-path mode and support vector (SVM). Firstly, we constructed a multi-module heterogeneous network based on the meta-path and learned the latent features of different modules by GAT. Secondly, we found the average of the latent features with weight to obtain a final node representation. Finally, we characterized miRNA–disease-association pairs with the node representation and trained an SVM to recognize potential associations. Based on the five-fold cross-validation and benchmark datasets, the proposed method achieved an area under the precision–recall curve (AUPR) of 0.9379 and an area under the receiver–operating characteristic curve (AUC) of 0.9472. The results demonstrate that our method has an outstanding practical application performance and can provide a reference for the discovery of new biomarkers and therapeutic targets.


Introduction
MicroRNA (miRNA), with a length between 18 and 24 nucleotides, is one of the types of non-coding RNAs in cells. Previously, miRNA was considered as a useless clip of human gene and even once called 'junk gene' because it could not encode protein [1]. However, more and more research studies show that miRNA is able to regulate the gene expression affecting some essential biological processes, such as proliferation, division, growth and apoptosis of the cell [2][3][4]. It commonly binds with messenger RNA (mRNA) at the three prime untranslated region (3 UTR) to achieve the transcription repression or degradation of the mRNA target [5]. Therefore, a high or low level of miRNA can lead to chaotic protein synthesis, which may destroy normal metabolism and cause dysfunction, further inviting diseases [6]. In addition, some studies have also shown that miRNA serving as an epigenetic regulator of gene expression goes hand in hand with human diseases [7][8][9][10]. Therefore, to identify the association between miRNAs and diseases is helpful for understanding the pathogenesis of disease. Moreover, miRNA can serve as a promising biomarker for diagnosis or a target for treatment [11]. Nevertheless, traditional biological experiments, limited by high cost, and being laborious and time consuming, are prone to failure to find all the relations between miRNA and disease. With the development of database and biotechnology, the massive accumulation of biological data enables researchers to multi-module meta-path along with graph attention network is employed to extract the network topology features of miRNAs and diseases, and support vector machine (SVM) is used as classifier to identify the potential MDA (MMGAN-SVM). Finally, five-fold crossvalidation is conducted to evaluate the prediction performance and the case studies with lymphoma, liver neoplasms and lung neoplasms are performed to demonstrate the practical application performance.
Overall, the main contributions of this work are as follows: 1.
Protein information and meta-path strategy were utilized to construct the multimodule, which can enrich the information of miRNAs and diseases. 2.
The topological and semantic information can be better learned by Graph attention network and attention mechanism. 3.
A reliable negative sample selection strategy was utilized to overcome the imbalance between positive and negative samples.

Dimension Optimization of Node Representation
The node representation implies the complex information in latent feature space, and its dimensionality affects the predictive performance of the model. A low number of dimensions may lead to the loss of information, while a high number of dimensions will lead to the introduction of noise and time consuming for calculation. Thus, discovery of the optimized dimension of the node representation is attempted based on the 5-CV through changing dimension in the range of (32,64,128,256,512). The experiment is repeated 10 times for each dimension. Here, Acc, Roc and Aupr are utilized to evaluate the effect of dimension on model performance and statistical average results are shown in Figure 1. The Acc of each dimension is 0.8591, 0.8628, 0.8719, 0.8751 and 0.8700 and its standard deviation (std) is 0.0030, 0.0038, 0.0033, 0.0023 and 0.0040. The Roc of each dimension is 0.9178, 0.9251, 0.9392, 0.9472 and 0.9448 and its std is 0.0031, 0.0054, 0.0030, 0.0016 and 0.0034. The Aupr of each dimension is 0.8965, 0.9058, 0.9180, 0.9379 and 0.9401 and its std is 0.0052, 0.0054, 0.0063, 0.0042 and 0.0043. We can conclude that higher dimensionality tends to be better performance. However, a high feature vector can lead to a huge computational burden and long model training time. Therefore, the optimal feature dimension for node representation is set to 256. The learning curve of our model is shown in Figure 2 and the result illustrates that the model has been trained in an optimal state.

Classifier Optimization
Here, deep neural networks (DNNs), such as multi-layer perceptron (MLP), convolutional neural networks (CNN) and the traditional machine learning method, including SVM and random forest (RF), are utilized to construct a model. The 5-CV is conducted with a different model 10 times in the same condition as well as with the optimal parameter, and the result is shown in the Figures 3-5 and in Table 1. In the 5-CV experiment, SVM shows the best performance in Auc and Aupr. Although the evaluation measures of SVM in 10 repetitive experiments are a little better than those of other models, SVM performs a lower std than other models. In conclusion, SVM shows the better performance in the majority of evaluation measures.

Comparison with Other Methods
To further demonstrate the performance of the current method, a comparison is performed with some state of art methods including PBMDA [20], WBNPMD [37], NIM-CGCN [27], DNRLMF-MDA [38] and VGAE-MDA [39]. PBMDA is a path-based method which aims at eliminating weak interactions. WBNPMD predicted the MDA by the bipartite network projection with weight. NIMCGCN is a matrix completion-based method which learns the feature by GCN. DNRLMF-MDA is a matrix factorization-based method and it utilized dynamic neighborhood regularization to improve performance. VGAE-MDA adopted variational graph auto-encoders to integrate the score from well-trained two subgraphs. Based on the benchmark dataset, the best results of 5-CV from our model are shown in Figures 6 and 7. The average of Acc, Roc, Aupr and F1 measured ten times in the experiment are 0.8753, 0.9472, 0.9374 and 0.8801 with the std 0.0036, 0.0015, 0.0030 and 0.0034, respectively. The five-fold cross-validation results of the existing methods are shown in Figure 8. The AUCs of PBMDA, WBNPMD, NIMCGCN, DNRLMF-MDA and VGAE-MDA are 0.9172, 0.9173, 0.9291, 0.9357 and 0.9394, respectively. In our method, the features of miRNAs and diseases are not only enriched by extra information of the protein but also integrated with the structure semantic information of MDA. In addition, SVM can show great performances in nonlinear classification tasks. Due to these strategies, the result also illustrated correspondingly that our method presented an outstanding performance.

Proportion of Negative Sample
In fact, the number of negative samples is much larger than the number of positive samples. Therefore, the impact of the ratio of positive and negative samples on the performance of the model is further investigated. A negative sample with different ratios of 1:1, 1:2, 1:3, 1:4 and 1:5 is randomly selected to conduct the 5-CV, and the result is shown in Figure 9 and listed in Table 2. As we can see, some evaluation measures are affected significantly by the unbalance between positive and negative samples, because these evaluation measures are sensitive to the ratio between positive and negative samples. With the increase in negative samples, the value of Aupr, Sens, F1 and Mcc slowly descends. A greater number of negative samples involved in the training procedure makes it easier for the model to identify the negative samples. Thus, the value of Acc increases along with the growth of ratios. The values of Auc and Aupr fluctuate within a controllable range. However, aiming at digging potential MDA, it is necessary for the model to obtain high sensitivity. Thus, to display the best performance of our model, the proportion of negative and positive samples is set as 1:1.

Reliability of Negative Sample
At present, there is no database dedicated to collecting miRNA-disease non-association pairs because these pairs cannot provide more information to promote the mechanisms' research and drug discovery. To overcome this problem, a random matching method is utilized to construct negative samples; however, it may contain false negative samples. Therefore, the influence of negative sample reliability on model performance is further studied. At first, we calculated the mean values of all dimensions for all the positive samples to form a cluster vector. Then, we obtained the average Euclidean distance (AED) by calculating Euclidean distance between each negative sample and the cluster vector. Depending on different threshold of AED, the original negative sample set was able to be refined and shrunk. The AED threshold was set in the range of (0.4AED, 0.5AED, 0.6AED, 0.7AED, 0.8AED, 0.9AED 1.0AED) and the negative samples whose Euclidean distance was lower than the threshold were removed to obtain different negative sample datasets. Then, negative sample with the same ratios as positive samples were randomly selected from the dataset for the training set in each of the threshold experiments. The results of different threshold are shown in Figure 10. The values of Acc, Auc, Aupr, Sens, Spec, Precision,

Case Studies
To illustrate the practical application performance of our model, the case studies are implemented over the three common human diseases: liver neoplasm, lung neoplasm and leukemia. Specifically, the MDA information of each case study is erased during the model training and the prediction score is acquired for all the miRNA candidates. Here, the ratio of positive and negative samples is set as 1:1 and the strategy of reliable negative samples selection is utilized in training procedure. According to the prediction scores, these identified potential disease-related miRNAs are ranked in descending order. For the three diseases, the recognized top 30 miRNAs and the corresponding scores are listed in Tables 3-5, respectively. Meanwhile, these results are validated by the databases of HMDD V3.0 and dbDEMC. The latter is a database recording the expression profiles of cancer-related miRNA and the published literature [40].    Development of liver neoplasm, which has the highest mortality rate in the East Asia region, is contributed to by genetic and epigenetic factors [41]. There are two principal subtype of liver cancer, hepatocellular carcinoma (HCC) and cholangiocarcinoma, and the former is the main type happening to the case in [42]. All the top thirty miRNAs predicted can be confirmed by HMDD V3.0 or dbDEMC. In addition, some researchers reported that the over-expression of miR-221/222 is responsible for the multifocality of HCC, and the over-expression if miR-155 occurs after the cancer recurrence [43,44]. All of those miRNAs appear in the Top thirty predicted results.
Lung neoplasm is a common tumor with the highest morbidity, after breast neoplasm, worldwide, and it can be divided into two categories: small cell lung carcinoma and nonsmall cell lung carcinoma [45]. As listed in Table 4, all miRNAs can be validated by HMDD V3.0 or dbDEMC. Fan et al. reported the expression of miR-20a and miR-15b to be evidence to distinguish the case from healthy individuals [46]. In addition, miR-223 and miR-145 in plasma can be considered as potential biomarkers for early diagnosis [47].
Leukemia is recognized as a progressive malignant disease and is divided into four main types: acute leukemia, chronic leukemia, myelogenous leukemia and lymphocytic leukemia [48]. Twenty-nine of the top thirty predicted miRNAs in Table 5 can be validated by HMDD V3.0 or dbDEMC. Only one predicted result of miR-200b without recorded in database; however, it is revealed to promote the cell proliferation and invasion in leukemia [49]. MiR-200b acts as an oncogenic regulator in human lung cancer. The proliferation, invasion and apoptosis of leukemia cells can be controlled by miR-200b through its regulatory of NOTCH1 signaling pathway. In addition, the inactivation of miR-155 and miR-29 contribute in leukemia and the expression decrement of miR-223 can be used to distinguish the case from a healthy individual [50,51].

Discussion
As the epigenetic controller, miRNAs are involved in gene expression and cellular signaling pathways, which makes a difference in cell propagation, division, growth and apoptosis leading. With these functions, miRNAs are considered to play a critical role in the initiation and progression of human diseases as well as being the promising biomarker or therapeutic target to help with the early diagnosis and treatments. Hence, it is meaningful to discover the potential related miRNAs for a disease. In this study, a model is proposed for feature extraction and to build a model classification. The results have been compared with the state-of-the-art methods in 5-CV and the case studies showed that our method has a great performance. The comparisons of our method and other methods were performed, and advantages and drawbacks are listed in Table 6. The methods of PBMDA and WBNPMD obtained AUC of 0.9172 and 0.9173, respectively, because neither complex network was created, nor weighted edges adopted. Construction of the complex network contribute to enriching potential information of networks and adoption of a weighted edges strategy brings known microRNA disease associations into sharper focus. On the contrary, the methods of NIMCGCN, DNRLMF and VGAE-MDA with AUC of 0.9291, 0.9357 and 0.9394 not only constructed a complex network but also adopted diverse strategy to improve the prediction performance, such as neural inductive matrix completion (NIMCGCN), dynamic regularized weight (DNRLMF) and variational Bayesian inference (VGAE-MDA). Our method obtained the highest AUC of 0.9472, because the complex network was constructed and the weighted edge of microRNA disease association was considered among different modules. In addition, the weighted parameters can be adaptively learned by loss function. However, the unbalance sample problem should be investigated for all methods. The outstanding performance of our method stems from three factors. First, the information of protein is introduced to enrich the feature of miRNAs and diseases, and a composite module based on meta-path is constructed. Second, the latent feature incorporating information of the node and topological structure are extracted by node aggregation in different meta-paths and modules with an attention mechanism. Third, SVM is able to complete the non-linear classification task well on the feature extracted. In the future, much more information, such as miRNAs expression profiles, miRNA sequences and drugs, will be taken into account to improve the MDAs prediction performance. In addition, more efficient feature extraction algorithms will be a novel direction.

Materials and Methods
The experiment-verified miRNA-disease associations were retrieved from the Human microRNA Disease Database (HMDD) [52]. In this work, HMDD V2.0 was adopted as the benchmark dataset with 5430 human miRNA-disease associations incorporating 495 miRNAs and 383 diseases after deduplication and normalization. For convenience, these associations are described as an adjacent matrix A ∈ {0, 1} m × n , in which m and n are the number of miRNAs and diseases, respectively. If an miRNA i is associated with a disease j, the value of A i,j is 1, and 0 vice versa. In addition, the miRNA-protein associations were collected from the miRTarBase database and the disease-protein associations from Comparative Toxicogenomics Database (CTD) [53,54].

MiRNA Integration Similarity
MiRNA integration similarity was composed of miRNA functional similarity (MFS) and Gaussian interaction profile kernel similarity [55]. The calculation of MFS was defined according to the previous work, which was based on the assumption that the function of two miRNAs are more similar if the number of the common disease associated with them is greater [56,57]. The score of MFS was defined as FS(i, j), i.e., the similarity score between miRNA i and miRNA j. In addition, to supplement the missing entries of MFS, Gaussian interaction profile kernel similarity mGS(i, j) was adopted and defined as Equation (1): where M(m i ) and M(m j ) indicate the ith and jth row of the adjacent matrix A, respectively. δ m represents the kernel bandwidth parameter, and is illustrated as Equation (2): where m is the number of miRNAs. Finally, miRNA integration similarity is described as Equation (3):

Disease Integration Similarity
Disease integration similarity constitutes disease semantic similarity and Gaussian interaction profile kernel similarity. The entry of diseases in the National Library of Medicine (http://www.ncbi.nlm.nih.gov/ (accessed on 5 November 2021)) describes the relationship among different disease, which can be used to construct a hierarchical directed acyclic graph (DAG). According to the definition by Wang et al. [56], semantic contribution of a disease d is calculated as Equation (4): where σ is a semantic contribution decay factor, and is maintained it as the same as the previous work that was set as 0.5 [56]. The semantic value of the disease d i , DV(d i ) is defined as Equation (5): where D(d i ) is the node set of disease d i and its ancestor. The semantic similarity score between disease d i and d j can be calculated as Equation (6): Finally, the disease semantic similarity combined with Gaussian kernel similarity is defined as Equation (7): if there is a semantic similarity between d i and d j dGS d i , d j otherwise (7) where dGS is the Gaussian kernel similarity of disease and defined as Equation (8), its formulation is similar with Equation (8).
where Ds(d i ) and Ds(d j ) indicates the ith and jth column of the adjacent matrix A, respectively.

Multi Module Construction
Meta-path is explained as a path in form of P = N 1 → N m (simplified as N 1 N 2 . . . N m−1 N m ), which illustrates that the starting node N_1 is able to reach one of the destination nodes connected by a composite relation R = r_1 r_2..r_(m− 2) r_(m -1) [58]. Based on the meta-path, various significance can be received from the relation between two of the identical type nodes. Thus, miRNA-protein association and diseaseprotein association were introduced to enrich the information of the miRNA-miRNA and disease-disease association (MMA and DDA) network. Another four association matrices MDMA (MMA based on disease), MPMA (MMA based on protein), DMDA (DDA based on miRNA) and DPDA (DDA based on protein) are shown in Figure 11. Take MDMA for example, the value of MDMA(i, j) is 1 when miRNA i can reach miRNA j through a disease d, and it is 0 vice versa. To increase the density of MDA to about 3%, the heterogeneous graph is constructed in term of multi-module as Equations (9)-(11): where the A T is the transposition matrix of A.

Node Feature Linear Transformation and Aggregation
The original feature of miRNA and disease had to be projected into the same latent feature space, because their original feature represented two different feature spaces. The latent feature of nodes could be obtained by leveraging the transformation matrix to carry on linear transformation. Specifically, each type of node adopts a respective transformation matrix. For the node i ∈ N c of type c, the latent feature h i ∈ R d of it could be obtained by using Equation (12): where W c ∈ R d × n is the transformation matrix of type c and x i c ∈ R n is the original feature of node i.
GAT was able to aggregate the information of neighboring nodes for the central node by of assigning learnable weight, which finally obtained the node representation by fusing the information of the network topological structure and node feature. Specifically, GAT adopted the SoftMax function to calculate the attention score of each node, and then continually updated the information of the central node by aggregating that of neighboring nodes based on their respective attention score. For each graph G constructed by metapath, the importance e G ij contributed by neighbor node j to central node i was defined by Equation (13): where ω G ∈ R 2 d is the attention parameter vector for graph G and || represents the concatenation operation. SoftMax function was utilized to normalize the importance of all nodes in order to obtain the final attention score α ij which was defined by Equation (14): (14) where k indicates the neighbor node of i in the graph G.

Module Aggregation
Based on the attention score, the information of node i was able to aggregate that of its neighbor nodes and eventually obtain the node representation z G i of graph G and defined as Equation (15): where σ(·) represents the nonlinear activation function, and sigmoid function was used in the current study.
For the sake of the stability and low variance, multi-head attention mechanism was introduced to improve the learning process of attention score. Specifically, the aggregation was repeated for K times and the formulation (15) can be revised by Equation (16): In addition, due to different meta-paths, every node obtained more than one representation in various semantic significances. To figure out which meta-path was more essential, an attention mechanism could be also adopted among the representations obtained by different meta-paths. We denoted that different meta-path mode as G 1 , G 2 , . . . , G n and the corresponding node representation as z G 1 i , z G 2 i , . . . , z G n i . Then, an attention mechanism was used to fuse and average all of the representations with their respective weight, defined by Equations (17)- (19).
where W c ∈ R D×d and ε ∈ R D are the weight matrix and bias vector. v c is a set of neighbor nodes in the same meta-path mode and λ T R D is the attention vector of all meta-paths for node type c. β G k indicates the final attention score after normalizing the importance contribution of a meta-path and z i is the final node representation.

Training and Prediction
Inspired by some matrix factorization or completion method, the latent feature of miRNA and disease was obtained by pre-training [59][60][61][62]. Specifically, the final node representation was used to reconstruct MDA by an inner product operation and the reconstruction error was reduced through minimizing the cross-entropy loss function defined by Equation (20): where a and a are the original MDA and reconstruction MDA, respectively. According to minimizing the loss function, the parameter mentioned above is constantly trained. When the well-trained latent feature of miRNAs and disease was acquired, the form of the miRNA-disease pair was concatenated as the input for SVM, which is a binary classifier with a significant accuracy and robustness in sparce and noise data. With the kernel function, it was also able to implement the non-linear classification and cater to the data complexity of miRNA and disease. After the prediction of SVM, the miRNAdisease pair association score can be obtained to indicate the association probability of the miRNA-disease pair.

Model Experiment and Evaluation
In this work, five-fold cross validation (5-CV) was utilized to evaluate the prediction performance of the model. The known MDA was considered as the positive sample and the stochastically selected identical amount of unobserved MDA as the negative sample. Positive and negative samples were combined into a dataset, which was randomly divided into five equal-sized subsets. Each subset was used as the test set in turn, and the remaining subsets were utilized as a training set. To reduce the variance cause by randomness, the procedure was repeated 10 times. To evaluate the performance of the model, several evaluation measures were taken into account, including accuracy (Acc), precision (Pre), specificity (Spe), sensitivity (Sens, also called as recall), F1-mesure(F1), Matthews correlation coefficient (Mcc), the areas under receiver operating characteristic (ROC) curve (AUC) and areas under precision-recall (PR) curve (AUPR). The Acc, Pre, recall and F1 were calculated by Equations (21) where the TP, TN, FP and FN represent the number of true positives, true negatives, false positives and false negatives, respectively. For the part of feature extraction, we pretrained it for 4000 epochs and employed an Adam optimizer with a learning rate 0.001. In addition, the other hyper parameters only affected the dimension of node representation, and the dimension was set in the range of (32,64,128,256,512). For the part of prediction with SVM, the penalty factor C was set as 150 and the radial basis function was used as a kernel function. In addition, the architecture and parameters of the model are listed in Table 7. SVM Kernel function (radial basis function) C factor (50) The framework of MMGAN-SVM is illustrated in Figure 12. First of all, as shown in Figure 12a, known miRNA-disease association was coordinated with Gaussian interaction profile kernel to calculate the miRNA (disease) integrated similarity (MS/DS). The miRNA-miRNA (disease-disease) relation network was built by adopting meta-path with the help of MDA, miRNA-protein association and disease-protein association. Second, as shown in Figure 12b, with the preparation above, combination matrix constructed the multi-module to be the input of the model. Then, as shown in Figure 12c,d, the latent feature of miRNA (disease) was acquired by model concatenate in the form of an miRNA-disease pair, which later served as the input of SVM for MDA prediction.

Conclusions
In this study, extra protein information and meta-path were introduced to construct a multi-module, and GAT was utilized to learn the latent feature of the node in every module. Then, with the attention mechanism, the topological and semantic information of nodes could be aggregated adaptively by their different neighboring nodes and modules. With abundant information of latent features, the latter classification task conducted by SVM obtained a great performance in MDA prediction. In addition, the impact of different ratios of positive and negative samples on the model was explored. To some extent, the unbalance between negative and positive samples actually made some influences on the model. Thus, the strategy of reliable negative sample selection was adopted to reduce the impact of sample unbalance. In addition, the results showed that the performance of prediction can be improved by selecting negative samples within a certain threshold. In conclusion, we propose a new avenue for research to discovery potential biomarker and treatment for diseases.