Using Sequence Similarity Based on CKSNP Features and a Graph Neural Network Model to Identify miRNA–Disease Associations

Among many machine learning models for analyzing the relationship between miRNAs and diseases, the prediction results are optimized by establishing different machine learning models, and less attention is paid to the feature information contained in the miRNA sequence itself. This study focused on the impact of the different feature information of miRNA sequences on the relationship between miRNA and disease. It was found that when the graph neural network used was the same and the miRNA features based on the K-spacer nucleic acid pair composition (CKSNAP) feature were adopted, a better graph neural network prediction model of miRNA–disease relationship could be built (AUC = 93.71%), which was 0.15% greater than the best model in the literature based on the same benchmark dataset. The optimized model was also used to predict miRNAs related to lung tumors, esophageal tumors, and kidney tumors, and 47, 47, and 37 of the top 50 miRNAs related to three diseases predicted separately by the model were consistent with descriptions in the wet experiment validation database (dbDEMC).


Introduction
According to molecular biology, the entire human genome can be divided into regulatory, coding, and noncoding genes. The coding genes are the main carriers of human genetic information, accounting for only 1.5% of the whole human genome [1][2][3]. Previous studies have regarded non-coding genes as transcriptional noise in the coding process. However, microarray experiments have found that non-coding genes can participate in cell activities by interacting with proteins and DNA, affecting gene activation and silencing; RNA splicing, modification, and editing; and protein translation, thus affecting various physiological processes. MicroRNA (miRNA) is a type of non-coding, single-stranded RNA encoded by endogenous genes. The most common role of miRNA is to directly regulate target genes by affecting post-transcriptional gene regulation of promoters. Abnormal expression of miRNA can cause many human diseases, and miRNA can be used as a drug target for disease treatment [4,5]. Therefore, the regulatory role of miRNA in disease expression is significant for a variety of complex human physiological processes and disease pathophysiology. However, traditional experimental methods are often limited. To this end, it is of great significance to speed up the verification process, reduce bias in biological experiments, and establish a method to predict the possible associations quickly and effectively between miRNAs and disease [6].
Common biological experimental methods from the 1990s to the beginning of the 21st century include reverse transcription-polymerase chain reaction (RT-PCR) [7], Northern blotting [8], and microarray profiling [9]. Although these traditional methods can accurately ting from the advantage of graph neural networks in finding miRNA-disease associations to construct a feature extraction method based on miRNA sequence similarity information.
In this study, we propose a method to calculate the sequence characteristics and sequence similarity of miRNA using five distinct miRNA sequence characteristics. Then, by integrating disease semantic similarity, miRNA Gaussian interaction profile kernel similarity, and disease Gaussian interaction profile kernel similarity, we construct a new bipartite graph of miRNA and disease. Then, the auto-encoder of a graph neural network is used to predict miRNA-disease association. Moreover, we evaluated prediction performance using 5-fold cross-validation. The model, using sequence similarity based on the composition of k-spaced nucleic acid pair (CKSNAP) features and a graph neural network, achieved an average area under the curve (AUC) of 93.71 ± 0.42%, with an accuracy of 83.69 ± 1.42%, precision of 77.73 ± 2.29%, recall of 94.62 ± 0.97%, and F1-score of 85.31 ± 1.00%. To further verify the performance of this model, case studies on lung, esophageal, and kidney neoplasms were conducted. Respectively, the results showed that 47, 47, and 37 of the top 50 predicted miRNAs for these neoplasms can be confirmed by the database of Differentially Expressed MiRNAs in human Cancers (dbDEMC) [33]. In conclusion, it can be inferred that our model is effective and accurate in identifying potential associations between miRNA and disease.

Human miRNA-Disease Associations
In this study, we adopted the Human microRNA Disease Database (HMDD; v3.2) as the benchmark dataset and directly downloaded the experimentally verified miRNA-disease associations from https://www.cuilab.cn/hmdd (accessed on 1 September 2021) [34]. There are 16,427 high-quality miRNA-disease associations recorded in the HMDD database, including 877 diseases and 901 miRNAs. We adopted the adjacency matrix A to quantify associations between these miRNAs and diseases. If a disease is associated with an miRNA, the value of the element at the corresponding position of matrix A is set to 1, and otherwise to 0.

miRNA Sequence Similarity
In this study, the attribute features of miRNAs were represented by sequence similarity information. We downloaded miRNA sequence information from https://mirbase.org/ ftp.shtml and utilized the K-mer (k = 3), Moran, Geary, NMBroto, and CKSNAP (k = 5) methods to obtain the sequence features of miRNAs.
K-mer: We set up a sliding window with a size of 3 and a sliding distance of 1 to obtain occurrence frequencies for all 3-monomere units. Then, each miRNA sequence was converted into a 64-dimensional vector based on the 64 3-monomer combinations. On this basis, we used cosine similarity, euclidean distance, and the Pearson correlation coefficient to calculate miRNA sequence similarity, defined as follows: where i represents a nucleotide combination with a length of three, such as AAA, AAC, and AAG; N(i) is the number of nucleotide combinations and N is the length of a nucleotide sequence. Moran: Moran describes miRNA by physicochemical parameters of nucleotides. The physical and chemical properties of miRNA include Rise (RNA), Roll(RNA), Shift(RNA), Slide(RNA), Tilt(RNA), Twist(RNA), Entropy(RNA), Adenine content, Purine(AG) content, Hydrophilicity(RNA), Enthalpy(RNA)1, GC content, Entropy(RNA)1, Hydrophilicity(RNA)1, Free energy(RNA), Keto(GT) content, Free energy(RNA)1, Enthalpy(RNA), Stacking energy(RNA), Guanine content, Cytosine content, and Thymine content. On this basis, we calculated the miRNA sequence similarity, defined as follows: where the Moran feature of miRNA are defined as follows: where d represents the lag value of autocorrelation, and nlag represents the maximum value of lag. In this study, the lag value d = 3 was selected; M i is the nucleotide properties at position i. M is calculated as follows: Geary: Geary uses the attribute information of nucleotides to describe the sequence characteristics of miRNA. We calculate the miRNA sequence similarity, defined as follows: where the Geary feature of miRNA are defined as follows: The meaning of physical and chemical indicators involving nucleotides, such as d, M, M i , and nlag, is the same as that in the Moran autocorrelation descriptor.
NMBroto: NMBroto is a normalized Moreau-Broto autocorrelation descriptor. The miRNA sequence is calculated similarity, defined as follows: where the NMBroto feature of miRNA is defined as follows: The meaning of physical and chemical indicators involving nucleotides, such as d, M, M i , and nlag, is the same as that in the Moran autocorrelation descriptor.
CKSNAP: CKSNAP uses k-spaced nucleic acid pairs to describe the frequency of separating the current nucleic acid pair from any k nucleic acids.
where the CKSNAP features of miRNA are defined as follows: Using five miRNA sequence features and three similarity calculation methods, 15 miRNA sequence similarity matrices could be obtained. In the following research, all sequence similarity matrices are called MSSM.

Disease Semantic Similarity
Disease semantic similarity can be calculated based on the medical subject heading (MeSH) descriptors, which are available at https://www.ncbi.nlm.nih.gov (accessed on 1 January 2021). The relationship between diseases can be represented as a directed acyclic graph (DAG) network according to disciplines or affiliations, where the nodes represent the MeSH descriptors of diseases, and the directed edges point from parent nodes to child nodes [35]. We adopted DAG d i (d k ) to describe semantic contribution of disease d k to disease d i . On this basis, the semantic contribution is defined as follows: where ∆ is the semantic contribution attenuation facto; this was set to 0.5 according to a previous study [30]. The semantic contribution value will decrease with distance. According to the semantic contribution value of disease nodes, the semantic value of disease d i was calculated as follows: The semantic similarity between disease d i and disease d k could be calculated by the nodes shared by the two disease DAG networks. It was defined as follows: where the element DSSM(d i , d j ) represents the disease semantic similarity between d i and d j .

Gaussian Interaction Profile Kernel Similarity for miRNAs and Diseases
In this study, the binary vector IP(m i ) to denote the interaction profiles of miRNA m i was defined by calculating the correlation between m i and m j . The miRNA Gaussian interaction profile kernel similarity matrix (MGSM) could be calculated as follows: (25) where σ m was applied for controlling the bandwidth of the kernel, and IP(m i ) − IP m j 2 was applied for calculating the euclidean distance of two eigenvectors. The Gaussian kernel bandwidth control parameter was calculated as follows: where nm represents the number of all miRNAs; σ m was set to 1 according to a previous study [30]. Similarly, the Gaussian interaction profile kernel similarity for diseases DGSM(d i ,d j ) between disease d i and d j could be calculated as follows: where nd represents the number of all diseases, σ d was set to 1 according to a previous study [30].

Integrated Similarity for miRNAs and Diseases
There is a large number of sparse values in the miRNA sequence similarity and disease semantic similarity matrices. The final integrated similarity for miRNAs and diseases was obtained from miRNA sequence similarity, disease semantic similarity, Gaussian interaction profile kernel similarity for miRNAs, and Gaussian interaction profile kernel similarity for diseases. It was defined as follows:

Graph Auto-Encoder
In the present study, the prediction of the potential miRNA-disease associations was mainly used to generate the low-dimensional embedding of node information through the encoder based on a graph neural network, so as to realize the identification of the correlation between miRNA and diseases by a bilinear decoder. Our model can be described in four steps, as shown in Figure 1.
First, we constructed a bipartite graph including 877 disease nodes and 901 miRNA nodes. For integrated similarity for miRNA, the similarity between miRNA m i and miRNA m 1 , m 2 , . . . , m 901 could be expressed as follows: where u 1, u 2 , u 3 , . . . , u 901 is the integrated similarity between miRNA m(i) and miRNA m 1 , m 2 , . . . , m 901 . Similarly, the integrated similarity between disease d i and disease d 1 , d 2 , . . . , d 901 could be expressed as follows: The vectors in F m (miRNA) and F d (disease) were embedded into miRNA nodes and disease nodes in the miRNA-disease bipartite graph. In this study, 16,427 known association pairs verified by experiments were regarded as positive samples of input. In order to balance the positive and negative samples, 16,427 negative samples were randomly selected from the unknown associations in the subsequent experiments. In order to project miRNA and disease feature vectors to the same vector space, we designed a linear transformation matrix. The projection process of miRNA nodes can be described as follows: where F m is the original characteristic matrix of miRNA, W ∅ m is to realize the linear transformation matrix of miRNA projection process, and H m is the feature matrix of miRNA projected into the new feature space. Similarly, the projection process of disease nodes can be described as follows: where F d , W ∅ m , and H d are as described above.  Second, we used the aggregator function of the auto-encoder to integrate the feature representation of the node and its neighbor nodes, update the node with a multi-layer perceptron, and embed the aggregated features into the original features of the node. The aggregator function sum(·) was used to realize the feature aggregation of its neighbor nodes, as shown in Equations (35) and (36), where sum(H d (1), H d (2), . . .) represents the aggregation process of the disease nodes d j ., which are the direct neighbors of the miRNA nodes m i ; Dm i is the degree value of node m i . In order to prevent numerical instability, H s m (i) was used for normalization. It is the aggregation feature of the current miRNA node m i .
After the aggregation features of all miRNA nodes were obtained through the aggregator function, the features of all nodes were embedded and connected through the multi-layer perceptron superimposed by the L-layer, and the final embedding of nodes was generated. We used a Leaky Rectified Linear Unit (LeakyReLU) function as the activation function of the multilayer perceptron to avoid the phenomenon of neuron "death" when the input was negative. It is defined as follows: where H m (i) is the original feature of node m i , H s m (i) is the aggregation feature of node m i , H m (i) is the updated node feature, and f (·) is the single-layer multi-layer perceptron function. Similarly, we calculated the aggregation feature of the current disease node d j , the degree value of node d j , and the updated node feature of disease as in Equations (38)-(40).
A multilayer overlay network could enhance the feature embedding of neighbor nodes and preserve the topology of graph data, so as to enhance the expression ability of features.
Third, considering that the number of unknown associations between miRNA and disease is much larger than the known associations, we adopted a sigmoid function as the activation function of the decoder to predict the association score between miRNA and disease. It is defined as follows: where Q is the E-dimensional trainable parameter matrix, H L d (j) is the feature embedding of L-layer disease node d j , H L m (i) is the feature embedding of L-layer miRNA node m i , and A(i, j) is the reconstructed association score matrix.
Finally, we used the deviation between the prediction score matrix and the original characteristic matrix and used the cross-entropy loss function and back propagation algorithm to optimize the model to obtain the best model parameters. It is defined as follows: where A(i, j) is the original characteristic matrix, andÂ(i, j) is the reconstructed prediction correlation score matrix. Due to the small proportion of known associations in the sample data, cost-sensitive learning was used to improve the weight of positive sample loss, so as to improve the prediction accuracy. Furthermore, y and yare the set of positive samples and the set of negative samples in the incidence matrix, respectively.

Model Evaluation
Five-fold cross validation was selected to evaluate the performance of the model. We divided the miRNA-disease associations into five sets, set one as the test set and the other four as the training set, and received five prediction results. Accordingly, the higher the score between miRNA and disease, the higher the possibility of a potential association between them. In the present study, we adopted four common evaluating indicators to evaluate the performance of the models: Accuracy (Acc), Precision (Prec), Recall, and F1-score. Meanwhile, we plotted the receiver operating characteristic curves to intuitively display the performance of our model and utilized the AUC to comprehensively evaluate model performance.

Performance Evaluation of Graph Neural Network Prediction Model Based on Single Features
Based on five sequence features, we used three different similarity calculation methods to obtain 15 different miRNA sequence similarity matrices and constructed 15 prediction models. The performance comparison of the prediction models based on different characteristics and similarity calculation methods is shown in Figure 2. The overall prediction performance of the model for calculating sequence feature similarity based on the Pearson correlation coefficient was better than the other two similarity calculation methods. This type of model obtained the optimal values in AUC, ACC, precision, and F1-score, and had obvious advantages over the other two similarity algorithms. Comparing the performance of the model based on five features, we found that the performance of the model based on K-mer and CKSNAP was better than the other models. Considering that AUC could more comprehensively evaluate model performance, the model based on CKSNAP sequence features and the Pearson similarity calculation method obtained the highest AUC value among the 15 models. In addition, this model was also better than other models in ACC, precision, and F1 scores. characteristics and similarity calculation methods is shown in Figure 2. The overall prediction performance of the model for calculating sequence feature similarity based on the Pearson correlation coefficient was better than the other two similarity calculation methods. This type of model obtained the optimal values in AUC, ACC, precision, and F1score, and had obvious advantages over the other two similarity algorithms. Comparing the performance of the model based on five features, we found that the performance of the model based on K-mer and CKSNAP was better than the other models. Considering that AUC could more comprehensively evaluate model performance, the model based on CKSNAP sequence features and the Pearson similarity calculation method obtained the highest AUC value among the 15 models. In addition, this model was also better than other models in ACC, precision, and F1 scores.

Performance Evaluation of Graph Neural Network Prediction Model Based on Combined Features
Based on the comparison and analysis of model performance based on single features, we proposed prediction models based on combined features, hoping to enhance the expression ability of nodes by embedding multiple features on a single node. Here, we combined the five features in pairs to obtain 10 combined features and used different similarity calculation methods to build 30 different prediction models based on these combined features. In order to compare the performance of the two types of models more

Performance Evaluation of Graph Neural Network Prediction Model Based on Combined Features
Based on the comparison and analysis of model performance based on single features, we proposed prediction models based on combined features, hoping to enhance the expression ability of nodes by embedding multiple features on a single node. Here, we combined the five features in pairs to obtain 10 combined features and used different similarity calculation methods to build 30 different prediction models based on these combined features. In order to compare the performance of the two types of models more intuitively, we classified these two types of models and calculated their average scores on each evaluation indicator and drew the prediction performance based on single-feature and double-feature models, as shown in Figure 3. We noticed that the evaluation indicators of models based on combined features were lower than those of the models based on single features. We speculate that this is because the dimensions of the combined features were too high. Embedding more feature information into a single node also introduced too much noise and redundant information, which led to model instability in the subsequent experimental process, resulting in a decline in prediction performance. In addition, we also found that the score of the models based on combined features depended on the score of the models based on single features. Here, we selected six groups of models for display, as shown in Figure 4. It can be seen that the models based on single features were better than the models based on combined features in AUC, Acc, Prec, and F1-score. In Figure 4, the gray columns are mostly located below the other two columns. The recall rate of models based on combined features in individual groups is higher. The reason for this is that the combined features enrich the information range, which leads to the improvement of recall rate. Combined features could enrich the information contained in a single node, which would make the coverage of information and the construction of relationship structure in the whole network more comprehensive, resulting in the improvement of Recall. However, due to the high dimensions of the feature vector and too much noise information on a single node, other evaluation indicators would decline.
Based on this, the subsequent experiments focused on performance optimization based on single-feature models. In a preliminary study, we set project dimension to E = 256 and set encoder layers to L = 2. In addition, we also found that the score of the models based on combined features depended on the score of the models based on single features. Here, we selected six groups of models for display, as shown in Figure 4. It can be seen that the models based on single features were better than the models based on combined features in AUC, Acc, Prec, and F1-score. In Figure 4, the gray columns are mostly located below the other two columns. The recall rate of models based on combined features in individual groups is higher. The reason for this is that the combined features enrich the information range, which leads to the improvement of recall rate. Combined features could enrich the information contained in a single node, which would make the coverage of information and the construction of relationship structure in the whole network more comprehensive, resulting in the improvement of Recall. However, due to the high dimensions of the feature vector and too much noise information on a single node, other evaluation indicators would decline.
Based on this, the subsequent experiments focused on performance optimization based on single-feature models. In a preliminary study, we set project dimension to E = 256 and set encoder layers to L = 2.

Effects of Projection Dimension and Encoder Layers on Model Performance
Here, we used the model based on CKSNAP features and the Pearson similarity calculation method to study the influence of projection dimension and encoder layers on prediction performance, changing the number of projection dimensions and encoder layers to obtain the scores of different models on five evaluation indicators. Different projection dimensions will affect the expression ability of nodes. Changing the number of encoder layers can adjust the learning degree of neural networks to node feature information. As can be seen from Figure 5a, with an increase in projection dimensions, the recall rate of the model fluctuated greatly, and the change in F1 score was not obvious. The accuracy score increased significantly with the increase of projection dimensions but dropped sharply when the projection dimension, E, was greater than 256. According to Figure 5b, with an increase in encoder layers, the evaluation indexes of the model showed a downward trend as a whole; especially after L > 6, the indexes of the model declined sharply.

Performance Evaluation and Comparative Analysis of Related Models
Here, we marked the top five models in each evaluation indicator and screened out the models with two or more markers to obtain the results shown in Figure 6. Considering that Acc was greatly affected by the proportion of positive samples in the sample set, Acc was not taken as the primary evaluation indicator. Precision and recall provide a single view of the performance, whereas F1 score provides a more comprehensive view of the performance and therefore was used for model evaluation. Therefore, considering AUC and F1 score, we noted that the model with project dimensions E = 64 and encoder layer L = 6 had better performance in all aspects. The receiver operating characteristic curve for the 5-fold cross validation experiment is shown in Figure 7. At the same time, the ROC curve of each fold can be seen in Figure S1.
In order to further evaluate the performance of this model, we compared it with six related models (PBMDA [36], LLCMDA [37], EDTMDA [38], GBDTLR [20], MCLPMDA [39], GAEMDA [40]) for comprehensive evaluation. Considering that different studies used different evaluation indicators, only the AUC value that could comprehensively evaluate the performance of the model was selected for comparative analysis. Our study selected the optimal AUC recorded in each paper for comparison, as shown in Table 1. Among the seven models, our model obtained the highest AUC value, which was 0.15% higher than the AUC value of the model with the second highest, GAEMDA.

Performance Evaluation and Comparative Analysis of Related Models
Here, we marked the top five models in each evaluation indicator and screened out the models with two or more markers to obtain the results shown in Figure 6. Considering that Acc was greatly affected by the proportion of positive samples in the sample set, Acc was not taken as the primary evaluation indicator. Precision and recall provide a single view of the performance, whereas F1 score provides a more comprehensive view of the performance and therefore was used for model evaluation. Therefore, considering AUC and F1 score, we noted that the model with project dimensions E = 64 and encoder layer L = 6 had better performance in all aspects. The receiver operating characteristic curve for the 5-fold cross validation experiment is shown in Figure 7. At the same time, the ROC curve of each fold can be seen in Figure S1.

Case Studies
In recent years, more and more research has shown that the mutation or abnormal expression of miRNA causes many human diseases [41,42]. In order to further evaluate the performance of our prediction algorithm, three neoplasm diseases were selected for independent case studies-lung neoplasm, esophageal neoplasm, and kidney neoplasmas it done in the reported method using the same dataset. We deleted the specific diseases of the case study from the training samples to remove bias from the experiments. We used the remaining miRNAs and diseases to construct test samples and ranked them according to the prediction scores. We compared the top 50 prediction results with the dbDEMC databases to obtain the prediction results.  Lung neoplasm is a malignant neoplasm disease occurring in lung parenchyma and stroma [43]. Lung cancer is one of the diseases with the fastest growth rate of morbidity and mortality. It has become the most common cause of death in malignant tumors; the prediction results of our model for miRNA related to lung neoplasm are shown in Table 2. It can be seen that 47 of the top 50 miRNAs could be verified in the dbDEMC database. Esophageal neoplasm is a malignant neoplasm disease that occurs in esophageal epithelial tissue [44][45][46]. Approximately 300,000 people die of esophageal cancer every year in the world. China is one of the high incidence areas of esophageal cancer in the world; the prediction results of our model for miRNA related to esophageal neoplasm are shown in Table 3. It can be seen that 47 of the top 50 miRNAs could be verified in the dbDEMC database. Kidney neoplasm is a common neoplasm disease in the urinary system. The pathological structure of kidney neoplasm is complex, and the cause of disease is variable [47]. Finding a new entry point for diagnosis is of great significance for the timely targeting of patients. The prediction results of our model are shown in Table 4. It can be seen that 37 of the top 50 miRNAs could be verified in the dbDEMC database. In order to further evaluate the performance of this model, we compared it with six related models (PBMDA [36], LLCMDA [37], EDTMDA [38], GBDTLR [20], MCLPMDA [39], GAEMDA [40]) for comprehensive evaluation. Considering that different studies used different evaluation indicators, only the AUC value that could comprehensively evaluate the performance of the model was selected for comparative analysis. Our study selected the optimal AUC recorded in each paper for comparison, as shown in Table 1. Among the seven models, our model obtained the highest AUC value, which was 0.15% higher than the AUC value of the model with the second highest, GAEMDA.

Case Studies
In recent years, more and more research has shown that the mutation or abnormal expression of miRNA causes many human diseases [41,42]. In order to further evaluate the performance of our prediction algorithm, three neoplasm diseases were selected for independent case studies-lung neoplasm, esophageal neoplasm, and kidney neoplasm-as it done in the reported method using the same dataset. We deleted the specific diseases of the case study from the training samples to remove bias from the experiments. We used the remaining miRNAs and diseases to construct test samples and ranked them

Conclusions
In the present study, we used a variety of miRNA sequence features to better retain the sequence similarity information between miRNAs and used a disease-miRNA bipartite graph to mine for potential deeper association information between miRNAs and diseases. Under a 5-fold cross validation, the experimental results showed that the prediction performance of the prediction algorithm based on combined features was not as good as that based on single features, and there was a dependency between the prediction score based on combined features and the prediction model based on corresponding single features. We used this model for case study verification, and the prediction results of three specific tumors also achieved a good hit rate. Therefore, our research is helpful for researchers to quickly and effectively study the relationship between miRNAs and diseases and plays a guiding role in research. It can save time and the cost of wet experiments to find disease-relevant miRNAs. This method can be used to predict the miRNA associated with a disease, and then perform wet experiment verification. Alternatively, there are results from wet experiments, and the method could be used to provide a confidence to refer to in experimental results. However, the analysis and discussion described in this paper is only a small part of the research on the correlation between miRNAs and disease, and there are more questions to be explored. For example, it could be useful to embed more biological information in the structural association between disease and miRNA. For example, when calculating the sequence similarity of miRNAs, we can consider introducing the functional similarity of miRNAs, the MISIM network, and the correlation information between miRNAs and proteins. Further work can focus on finding methods that can obtain deep-seated network structure information without affecting the prediction performance of the model. Finally, considering that the regulatory mechanisms of miRNA in many complex diseases also play an important role in miRNA-disease associations, the analysis of model performance combined with physiological influencing factors is expected to improve future experiments.