Dual Convolutional Neural Network Based Method for Predicting Disease-Related miRNAs

Identification of disease-related microRNAs (disease miRNAs) is helpful for understanding and exploring the etiology and pathogenesis of diseases. Most of recent methods predict disease miRNAs by integrating the similarities and associations of miRNAs and diseases. However, these methods fail to learn the deep features of the miRNA similarities, the disease similarities, and the miRNA–disease associations. We propose a dual convolutional neural network-based method for predicting candidate disease miRNAs and refer to it as CNNDMP. CNNDMP not only exploits the similarities and associations of miRNAs and diseases, but also captures the topology structures of the miRNA and disease networks. An embedding layer is constructed by combining the biological premises about the miRNA–disease associations. A new framework based on the dual convolutional neural network is presented for extracting the deep feature representation of associations. The left part of the framework focuses on integrating the original similarities and associations of miRNAs and diseases. The novel miRNA and disease similarities which contain the topology structures are obtained by random walks on the miRNA and disease networks, and their deep features are learned by the right part of the framework. CNNDMP achieves the superior prediction performance than several state-of-the-art methods during the cross-validation process. Case studies on breast cancer, colorectal cancer and lung cancer further demonstrate CNNDMP’s powerful ability of discovering potential disease miRNAs.


Introduction
miRNAs are non-coding single-stranded RNA molecules encoded by endogenous genes with a length of about 22 nucleotides. miRNAs exert their biological functions primarily via regulating the expression of target genes (mRNAs). miRNAs usually target to a specific sequence in the 3 untranslated terminal of mRNAs, inhibiting the translation of the target genes [1][2][3][4][5]. With the development of molecular biology and biotechnology, scientists find that the abnormal expression of miRNAs is closely related to various human diseases [6][7][8]. Therefore, predicting the potential disease-associated miRNAs is of great significance for understanding disease etiology and pathogenesis.
In recent years, several computational methods have been proposed for predicting disease-associated miRNAs, which can be classified into two main categories in general. miRNAs implement their biological functions by regulating the expression of their target mRNAs [9]. Therefore, the first category of methods is based on target genes to predict the potential associations

Performance Evaluation Metrics
Considering that most of the diseases in the HMDD database are only associated with a few miRNAs, they are not sufficient to evaluate the prediction performance of our method. Therefore, we performed five-fold cross-validation on the 15 diseases associated with more than 90 miRNAs to compare the prediction performance between CNNDMP and several state-of-the-art methods. First, we regard the known miRNA-disease associations as positive samples, and randomly divide them into five equal parts, and the unknown associations are regarded as negative samples. The negative samples (whose quantity is equal to that of the positive samples) are selected randomly from all the negative ones. These negative samples are also divided into five equal parts. Four parts of positive samples and four parts of negative samples are used as the training data in each-fold cross-validation. The remaining positive and the remaining negative samples are used as the testing data to verify the prediction performance.
We can obtain the association prediction scores in the testing data via the CNN prediction model and sort them by their values in descending order. If a known association exists between a pair of miRNA-disease sample, and the prediction score of the association is higher than the given threshold δ, it is a successfully identified positive sample. If the prediction score of a negative sample is lower than δ, it is a successfully identified negative sample. By changing the threshold, we can calculate the corresponding true positive rate (TPR), false positive rate (FPR), precision (Precision) and recall rate (Recall). They are defined as follows, where TP and TN represent the number of positive and negative samples correctly identified, FP represents the number of negative samples misidentified as positive samples, and FN represents the number of positive samples misidentified as negative samples. Each time the threshold δ is changed, the corresponding TPR and FPR values, as well as the Precision and Recall values, are obtained. The receiver operating feature curve (ROC) and the precision-recall curve (PR) are then drawn using these values. The areas under the ROC curve (ROC-AUC) and the PR curve (PR-AUC) are used to evaluate the whole prediction performance.
Biologists usually select the top-ranked miRNA candidates from the prediction result to further validate their associations with the disease. Therefore, we calculate the average recall values of the top 30, 60, 90-210 and 240 candidates for 15 diseases. Through the recall, we compare how many positive samples appear in the top k candidates in different methods. The larger the recall value, the more positive samples are identified successfully.

Comparison with Other Methods
CNNDMP is compared with GSTRW [22], DMPred [19], PBMDA [24] and Liu's Method [21], which are state-of-the-art prediction methods for miRNA-disease associations. The parameters involved in each method need to be adjusted to achieve the best prediction performance. In our method, w f , w p and d are set to 3, 2 and 11, respectively. Each convolutional layer contains 20 convolution filters, so n conv is set to 20. The restart probability β of random walk is 0.8, and the harmonic parameter λ is set to 0.9. λ varies from 0.1 to 0.9, and the corresponding performances of CNNDMP are listed in Table 1. For the other methods, we use the parameters mentioned in the corresponding papers (γ = θ = 0.2, α = β = 0.8, λ = η = 0.2, w = 0.6 for GSTRW, L = 3, α = 2.26 for PBMDA, λ M = 1 70 , λ D = 1 10 , θ = 1 20 for DMPred, λ = 0.8, δ = 0.9, η = 0.1, γ = 0.5 for Liu's Method).  Figure 1). CNNDMP achieved the best prediction performance, and its average AUC-ROC is 0.956, which is higher by 15.4%, 3.9%, 11.2%, and 9.1% compared to the other four methods, respectively. The miRNA-disease association scores of GSTRW are dependent on the calculation of miRNA similarities and disease similarities. Therefore, GSTRW performs the worst in all methods. The performance of PBMDA is similar to that of Liu's Method as they all exploit the network topology information. DMPred utilizes miRNA-and disease-related information and achieves a competitive predictive performance. Our method, CNNDMP, completely integrates the original feature of miRNAs, diseases and network topology, combines them with the powerful representation learning capability of CNN and achieves the best prediction performance.    There are far more unobserved miRNA-disease associations than known ones, so there is a serious class imbalance between them. For the imbalanced associations, the PR curves are better than ROC curves in reflecting the prediction performance of different methods. Figure 2 shows the PR curves of CNNDMP, GSTRW, DMPred, PBMDA and Liu's Method for 15 diseases. Their PR-AUCs are 0.538, 0.177, 0.392, 0.324, and 0.334, respectively. The PR-AUC of CNNDMP is 36.1%, 14.6%, 21.4%, and 20.4% higher than the other methods. As shown in Table 3, CNNDMP yields the best average performance in terms of PR-AUCs and achieves the best performance for 14 of 15 common diseases.    For the top k miRNA candidates, the higher recall rate means that there are more positive samples successfully identified. Figure    In addition, to further verify that the ROC-AUC and PR-AUC of CNNDMP are significantly higher than the other methods, we performed a paired t-test. All paired t-test results are less than 0.05, which indicates that CNNDMP's performance is significantly better than the other methods (Table 4). In addition, to further verify that the ROC-AUC and PR-AUC of CNNDMP are significantly higher than the other methods, we performed a paired t-test. All paired t-test results are less than 0.05, which indicates that CNNDMP's performance is significantly better than the other methods (Table 4).

Comparison between the Individual Networks and the Integrated Network
To verify that the performance of the integrated network is better than the individual networks, we evaluate the prediction performances of the left and right networks within CNNDMP, respectively. The values of ROC-AUC and PR-AUC of the left network are 0.916 and 0.509, respectively. For the right network, the values of ROC-AUC and PR-AUC are 0.905 and 0.494, respectively. Compared with the left and right networks, the ROC-AUC of the integrated network increased by 4% and 5.1%, and the PR-AUC increased by 2.9% and 4.4%.

Case Studies on Breast Cancer, Colorectal Cancer and Lung Cancer
To further demonstrate CNNDMP's ability to discover potential disease-associated miRNAs, we used three independent databases, dbDEMC [31], miRCancer [32], and PhenomiR [33], as well as the relevant literature to verify the candidates of breast cancer, colorectal cancer and lung cancer. We take the prediction results of breast cancer as an example, and list the results of this case analysis in detail.
We list the case study of the top 50 miRNA candidates related to breast cancer in Table 5. dbDEMC is a database of differentially expressed miRNAs in human cancers, and it contains 2224 differentially expressed miRNAs in 36 cancer types. Forty-three of the 50 miRNA candidates are included in this database, which confirmed the differential expression of these candidates in breast cancer. PhenomiR is also a database of differentially expressed miRNAs in human cancers. miRCancer is a miRNA-cancer associations database that collects 6323 miRNA-cancer associations from 4875 academic papers covering 184 cancers. PhenomiR includes two miRNA candidates, and miRCancer contains two candidates. Five miRNA candidates are confirmed in the relevant literature. The top 50 colorectal cancer-related candidates are given in Supplementary Table S1. The databases of dbDEMC and miRCancer respectively include 48 candidates and one candidate whose abnormal expressions have been identified in colorectal cancer. A candidate marked 'Unconfirmed' means that it is not currently supported by the databases and the relevant literature.
In terms of lung cancer, the top 50 candidates are listed in Supplementary Table S2. Forty candidates are included in dbDEMC and three candidates are contained by miRCancer which have abnormal expression in lung cancer. A candidate is supported by PhenomiR to have abnormal regulation in lung cancer. Four candidates are supported by the relevant literature to be differentially expressed in lung cancer. Three candidates marked 'Unconfirmed' are not currently supported by the databases and the relevant literature. The case studies on the three diseases confirm that the CNNDMP has a powerful ability to discover potential disease miRNAs.

Predicting Novel Disease-Related miRNAs
By comparing the ROC curve, PR curve and the recall rate of the top k candidates for the five methods by cross-validation, CNNDMP has achieved the best prediction performance. Subsequent case analysis results further confirm that CNNDMP has good prediction performance in discovering the associations between miRNAs and diseases. Therefore, we further apply this method to all 326 diseases. We take all the positive samples and the corresponding negative samples as training data. Finally, the top 100 miRNA candidates for each disease are given in Supplementary Table S3.

Dataset
The miRNA-disease associations used in this study derive from the human miRNA-disease database (HMDD) [39]. HMDD has collected thousands of reliable association pairs between miRNAs and diseases. After integrating different miRNA records and unifying the miRNA and disease names, we finally retained 5088 miRNA-disease associations, involving 490 miRNAs and 326 diseases. Disease terms are available from the National Library of Medicine (http://www.ncbi.nlm.nih.gov/ mesh). The phenotypic similarities and the semantic similarities are obtained from a published study [18].

Construction of a miRNA-Disease Heterogeneous Network
miRNA similarity measurement. Based on the biological observation that miRNAs with similar functions usually tend to be associated with similar diseases, the similarity of two miRNAs is estimated by measuring the similarities of their associated diseases. For example, miRNA m a is associated with diseases d 1 , d 3 , d 5 , d 6 , and d 7 , whereas miRNA m b is associated with diseases d 2 , d 3 , d 4 , and d 6 . Wang et al. [40] calculated the similarity between S a = {d 1 , d 3 , d 5 , d 6 , d 7 } and S b = {d 2 , d 3 , d 4 , d 6 } as the similarity of m a and m b , denoted as M(m a , m b ). The similarity between S a and S b includes the following three steps: first, the similarities between d 1 and each of the diseases in S b are calculated, and the maximum similarity is taken as the similarity between d 1 and S b . Similarly, the similarities between d 3 , d 5 , d 6 , d 7 and S b are obtained, respectively. Second, the similarities between each of diseases in S b and S a are calculated. Finally, these similarities are accumulated and divided by the total number of diseases in S a and S b . We use the matrix M ∈ R N m ×N m to represent the similarities of miRNAs, where N m is the number of miRNAs. The values of miRNA similarities are distributed between 0 and 1.
Disease similarity measurement. The disease similarity measures how similar they are from the perspectives of disease semantics and phenotype. The terms related to a disease are represented by a directed acyclic graph (DAG). If there are more common terms between the DAGs of two diseases, it means that the two diseases are more similar. At the same time, two diseases that share more common phenotypes are often more similar. Therefore, we quantify the similarity of two diseases based on their semantics and phenotype. Xuan et al. have successfully integrated this information and calculated the similarities between diseases. Therefore, disease similarities can be obtained from published studies [19,41]. We use the matrix D ∈ R N d ×N d to represent the similarities between diseases and values of the similarities vary from 0 and 1, where N d represents the number of diseases.
miRNA-disease associations. If miRNA m i is associated with disease d j then A ij = 1, or A ij = 0 when their association has not been observed. We use A ∈ R N m ×N d to represent the associations between miRNAs and diseases.
By exploiting the similarities of miRNAs and diseases, as well as the known associations between miRNAs and diseases, we construct a heterogeneous network including two kinds of nodes (miRNAs and diseases), and the matrix representation of the network (Figure 4). miRNA-disease associations. If miRNA is associated with disease then = 1, or = 0 when their association has not been observed. We use ∈ × to represent the associations between miRNAs and diseases.
By exploiting the similarities of miRNAs and diseases, as well as the known associations between miRNAs and diseases, we construct a heterogeneous network including two kinds of nodes (miRNAs and diseases), and the matrix representation of the network (Figure 4). The miRNA similarities network is constructed based on two miRNAs whose similarity are greater than 0 and the matrix representation . We represent miRNA network topology information and the similarity values between miRNAs by a weighted network. Each node represents a miRNA entity, and the weight on edge represents miRNA similarity values in the weighted network. (b) The disease similarities network and its matrix representation . (c) The miRNA-disease associations network is constructed based on the known associations between miRNAs and diseases, and its corresponding matrix representation . When a disease is associated with a miRNA, they are connected by a dotted line. (d) miRNA-disease heterogeneous network. It effectively integrates miRNA similarities, disease similarities and miRNA-disease association information. (a) The miRNA similarities network is constructed based on two miRNAs whose similarity are greater than 0 and the matrix representation M. We represent miRNA network topology information and the similarity values between miRNAs by a weighted network. Each node represents a miRNA entity, and the weight on edge represents miRNA similarity values in the weighted network. (b) The disease similarities network and its matrix representation D. (c) The miRNA-disease associations network is constructed based on the known associations between miRNAs and diseases, and its corresponding matrix representation A. When a disease is associated with a miRNA, they are connected by a dotted line. (d) miRNA-disease heterogeneous network. It effectively integrates miRNA similarities, disease similarities and miRNA-disease association information.

Prediction Model Based on Dual CNN
We construct a prediction model based on dual CNN, which is composed of left and right parts. The left part learns from the original feature information of miRNAs and diseases. The complex, implicit and nonlinear miRNA-disease feature information is captured by the CNN layer. The right part combines miRNA and disease network topology information and represents it deeply by the CNN layer. Finally, we integrate the results of the left and right to obtain final prediction scores for disease-associated miRNAs.

Embedding Layer
Embedding in the left part by integrating miRNA and disease original feature information. Functionally similar miRNAs are usually involved in similar diseases and vice versa. Therefore, we integrate miRNA and disease similarities and miRNA-disease associations to construct the embedding in the left part. We take the miRNA m 1 and disease d 2 in Figure 5 as an example to elaborate the integration process. The first row of M represents the similarities between m 1 and all the miRNAs, and the second row of A T represents the associations between d 2 and all the miRNAs.
The miRNA m 1 is similar to m 2 and m 4 , and the disease d 2 has a known association with m 2 , m 4 and m 5 . Thus, miRNA m 1 and disease d 2 are likely to be associated. Similarly, we integrate the first row of A with the second row of D. Among them, miRNA m 1 is associated with d 1 , d 3 and d 6 , and disease d 2 is similar to d 1 and d 3 , so miRNA m 1 and disease d 2 are likely to be associated. The final integration result is represented by the feature matrix X ∈ R 2×(N m +N d ) .

Embedding Layer
Embedding in the left part by integrating miRNA and disease original feature information. Functionally similar miRNAs are usually involved in similar diseases and vice versa. Therefore, we integrate miRNA and disease similarities and miRNA-disease associations to construct the embedding in the left part. We take the miRNA 1 and disease 2 in Figure 5 as an example to elaborate the integration process. The first row of represents the similarities between 1 and all the miRNAs, and the second row of represents the associations between 2 and all the miRNAs. The miRNA 1 is similar to 2 and 4 , and the disease 2 has a known association with 2 , 4 and 5 . Thus, miRNA 1 and disease 2 are likely to be associated. Similarly, we integrate the first row of with the second row of . Among them, miRNA 1 is associated with 1 , 3 and 6 , and disease 2 is similar to 1 and 3 , so miRNA 1 and disease 2 are likely to be associated. The final integration result is represented by the feature matrix ∈ 2×( + ) .

D X Embedding
Transpose of A Embedding in the right part by integrating the networks topology. We firstly obtain network topology information by random walking in the miRNA and disease networks, respectively. The basic principle of a random walk with restart is that the walker starts from a node in the network at 0th time and walks randomly in the miRNA (or disease) network. When the current node of the walker is more similar to a neighbor node, the probability that the walker turns to it is greater. Therefore, after the walking process converges, the probability that the walker reaches a certain node is greater, indicating that the node is more similar to the starting node. We define the convergent vector as ∞ , which represents the similarities between the starting node and all the nodes.
We take the miRNA network as an example to illustrate its computational process in detail. Firstly, we need to row-normalize the original miRNA similarities matrix M to obtain the probabilistic transfer matrix W. Then, based on the following random walk with restart iteration formula, Embedding in the right part by integrating the networks topology. We firstly obtain network topology information by random walking in the miRNA and disease networks, respectively. The basic principle of a random walk with restart is that the walker starts from a node in the network at 0th time and walks randomly in the miRNA (or disease) network. When the current node of the walker is more similar to a neighbor node, the probability that the walker turns to it is greater. Therefore, after the walking process converges, the probability that the walker reaches a certain node is greater, indicating that the node is more similar to the starting node. We define the convergent vector as p ∞ , which represents the similarities between the starting node and all the nodes.
We take the miRNA network as an example to illustrate its computational process in detail. Firstly, we need to row-normalize the original miRNA similarities matrix M to obtain the probabilistic transfer matrix W. Then, based on the following random walk with restart iteration formula, the network topology-based miRNA similarities are obtained. Taking miRNA m 1 as an example, the current random walk from node m 1 , the first element of p(0) is then set to 1 and the other elements are 0. The parameter β ∈ (0, 1) represents the probability that the walker returns to the starting node m 1 for re-walking. W T is the transposed matrix of W, p(t) represents the probability that the walker arrives at each miRNA node at time t, and p(t + 1) represents the arrival probability at time t + 1. After the walking process is converged, the vector p ∞ m 1 is obtained and regarded as a part of the embedding in the right part. When L 1 norm between p(t + 1) and p(t) is less than 10 −6 , the convergence condition is satisfied. Similarly, in the disease network, we randomly walk from the disease d 2 node, and finally get the vector p ∞ d 2 as a part of the right embedding.
where H ∈ R 2×v is a weight matrix between the fully connected layer and the output layer.  (16) where indicates the actual association between a miRNA and a disease. is 1 when the miRNA is associated with the disease, otherwise is 0.
(1) and (2) represent the scores of miRNA-disease associations that are classified as the negative sample and the positive one, respectively. and indicate the corresponding probabilities obtained by the softmax function. represents the number of training samples.  Figure 7. miRNA-disease association prediction framework based on dual CNN.

Conclusions
A novel method based on a dual convolutional neural network, CNNDMP, is developed for prioritizing potential disease miRNAs. CNNDMP's embedding layer is constructed from the biological perspective by combining the biological premise about miRNA-disease associations. At the same time, the embedding layer captures the original similarities and associations of miRNAs and diseases, as well as the topology structure of the miRNA and disease networks. The new framework based on a dual convolutional neural network is constructed for learning the deep

Convolutional Module on the Right
The embedding Y ∈ R 2×(N m +N d ) in the right part is input to learn the feature representation of the network topology ( Figure 7). The convolution and pooling processes of the right part are similar to that in the left part. The convolutional operation Z 2 and the max-pooling operation U 1 are defined as follows, where Z 2 is the feature output of the convolution operation and Y i1 is the first column vector in the sliding window when the filter moves to the ith position of Y. U 1 is obtained by performing the max-pooling operation on Z 2 . We use U 1 as the input of the second convolutional layer, and obtain the output of the second pooling layer U 2 ∈ R 2× 1 4 (N m +N d )×2n conv . Similarly, U 2 as the input of the third convolutional layer can obtain U 3 ∈ R 2× 1 8 (N m +N d )×3n conv . Finally, we flatten U 3 to the column vector p ∈ R v×1 , v = 2 × 1 8 (N m + N d ) × 3n conv and get the association score between m 1 and d 2 by the fully connected layer. The score is defined as score 2 ∈ R 2×1 , where K ∈ R 2×v is the weight matrix between the fully connected layer and the output layer.

Combined Strategy
The association scores score 1 and score 2 are obtained from different perspectives of miRNA-disease information. To take complete advantage of the prediction results from the left and right parts, we integrate the two scores as the final association score between a miRNA and a disease. It is defined as follows, score = λ × score 1 + (1 − λ) × score 2 (12) where the parameter λ ∈ (0, 1) is used to adjust the importance of score 1 and score 2 . The loss functions of the left and right CNNs are defined as loss 1 and loss 2 , a = e score 1 (2) ∑ 2 j=1 e score 1 (j) b = e score 2 (2) ∑ 2 j=1 e score 2 (j) (16) where y label indicates the actual association between a miRNA and a disease. y label is 1 when the miRNA is associated with the disease, otherwise y label is 0. score 1 (1) and score 1 (2) represent the scores of miRNA-disease associations that are classified as the negative sample and the positive one, respectively. a and b indicate the corresponding probabilities obtained by the softmax function.
T represents the number of training samples.

Conclusions
A novel method based on a dual convolutional neural network, CNNDMP, is developed for prioritizing potential disease miRNAs. CNNDMP's embedding layer is constructed from the biological perspective by combining the biological premise about miRNA-disease associations. At the same time, the embedding layer captures the original similarities and associations of miRNAs and diseases, as well as the topology structure of the miRNA and disease networks. The new framework based on a dual convolutional neural network is constructed for learning the deep features of the original similarities and associations of miRNAs and diseases, and the new miRNA and disease similarities. The results of cross-validation on 15 common diseases confirms CNNDMP's superior performance. The case studies on three diseases further show that CNNDMP has a strong ability to discover candidate disease miRNAs.
Author Contributions: P.X. and T.Z. conceived the prediction method, and P.X. wrote the paper. Y.D. and Y.L. developed the computer programs. Y.G. and T.Z. analyzed the results and revised the paper.

Conflicts of Interest:
The authors declare no conflict of interest.