GATCDA: Predicting circRNA-Disease Associations Based on Graph Attention Network

Simple Summary CircRNAs (circular RNAs), a novel kind of non-coding RNAs, play a regulatory role in cellular processes. A growing number of biological experiments has proved that circRNAs can be used as biomarkers and therapeutic targets of some cancers. As the time and financial costs of biological experiments are high, computational methods have become a better way to predict the associations between circRNAs and diseases. Graph attention network was first applied to predict circRNA-disease associations with multiple similarities of data in this study. The circRNA–miRNA interactions and disease-mRNA interactions were adopted to construct features. The computational method proposed in this study has improved the prediction performance. Abstract CircRNAs (circular RNAs) are a class of non-coding RNA molecules with a closed circular structure. CircRNAs are closely related to the occurrence and development of diseases. Due to the time-consuming nature of biological experiments, computational methods have become a better way to predict the interactions between circRNAs and diseases. In this study, we developed a novel computational method called GATCDA utilizing a graph attention network (GAT) to predict circRNA–disease associations with disease symptom similarity, network similarity, and information entropy similarity for both circRNAs and diseases. GAT learns representations for nodes on a graph by an attention mechanism, which assigns different weights to different nodes in a neighborhood. Considering that the circRNA–miRNA–mRNA axis plays an important role in the generation and development of diseases, circRNA–miRNA interactions and disease–mRNA interactions were adopted to construct features, in which mRNAs were related to 88% of miRNAs. As demonstrated by five-fold cross-validation, GATCDA yielded an AUC value of 0.9011. In addition, case studies showed that GATCDA can predict unknown circRNA–disease associations. In conclusion, GATCDA is a useful method for exploring associations between circRNAs and diseases.


Introduction
CircRNAs (circular RNAs) are a class of non-coding RNA molecules with a closed circular structure, without a 5 -end cap and a 3 -end ployA tail. They are mainly located in the cytoplasm or stored in exosomes, and are not affected by RNA exonuclease [1]. Although circRNAs are non-coding RNAs, some circRNAs can encode polypeptides. Currently, biological functions of circRNAs are well-recognized as follows [2]: miRNA sponges, regulatory protein binding, regulation of gene transcription, and coding functions. Cir-cRNA expression is more stable and not easily degradable, and has been proved to exist widely in a variety of eukaryotes [1]. Most circRNAs are formed by exon loops, and some circRNAs are lariat structures formed by intron loops. Because circRNAs contain a number of miRNA response elements (MREs), they can form the catalytic core of the RNA-induced silencing complex (RISC) with AGO proteins, which eventually leads to the degradation of circRNAs [3]. According to their sources, circRNAs can be roughly divided into four categories [4]: full-exon circRNAs, exon-introns circRNA (EIcircRNAs), intron-composed lariat circRNAs, and circRNAs produced by cyclization of viral RNA genomes (tRNA, rRNA, snRNA, etc.). Twenty years ago, scientists found circRNAs from plant viroids, yeast mitochondria, and hepatitis B viruses (HBV) as byproducts of abnormal splicing that have no regulatory function [5]. In 2013, Hansen et al. proposed and confirmed for the first time that circRNA is the regulatory mechanism of the miRNA sponge [6], providing a new field for circRNA research. With the rapid development in RNA sequencing technology and bioinformatics analysis, 14,807 candidate circRNAs have been identified in the human tissue transcriptome, and many exons have been found to form circRNAs by nonlinear reverse splicing or gene rearrangement in cells of other species.
In recent years, many studies [7][8][9] have shown that circRNAs are closely related to the occurrence and development of diseases, and predicted circRNAs' application prospects in aspects of diagnostic markers of diseases. For example, after brain/spinal cord injury, circRNAs can activate several biological, molecular, and cellular activities. Therefore, interventions centered on the regulation of circRNAs may be promising for traumatic brain injury and spinal cord injury [10]. Chen et al. demonstrated that circRNA circCTNNA1 promoted colorectal cancer progression by sponging miR-149-5p and regulating FOXM1 expression [11]. Wang et al. found that circCNST promoted the tumorigenesis of osteosarcoma cells by sponging miR-421 and targeting SLC25A3, providing a potential biomarker for patients with osteosarcoma [12]. Wu et al. discovered that circ_0009582, circ_0037120, and circ_0140117 may serve as potential biomarkers for predicting the occurrence of hepatocellular carcinoma in patients with HBV infection [13]. Li et al. revealed that circRNA circITGA7 may play a regulatory role in thyroid cancer and may be a potential marker for thyroid cancer diagnosis or progression [14].
At present, the number of known interactions between circRNAs and diseases obtained through biological experiments is increasing. Some relevant databases have appeared [15][16][17][18], including the interactions between circRNAs and diseases verified by biological experiments. Due to the significant time and financial costs associated with biological experiments, in recent years, it has become a hot topic to predict associations between circRNAs and diseases using computational methods with these databases. The current calculation methods can be divided into five categories [19]. The first category is a network propagating method. For example, the computational model BRWSP applies biased random walk to search paths on a multiple heterogeneous network to discover circRNA-disease associations [20]. A method called KATZHCDA uses KATZ measures for human circRNA-disease association prediction [21]. The second category is a recommendation system method. For example, Lei et al. proposed a computational method named ICFCDA based on collaboration filtering recommendation system, handling the "cold start" problem to predict potential circRNA-disease associations [22]. The third category is the matrix completion methods. For example, a computational method called iCircDA-MF was developed based on matrix factorization by Wei et al. [23]. Zhang et al. [24] utilized metapath2vec++ and matrix factorization to discover circRNA-disease associations. The fourth category is the classical machine learning methods. For example, based on a gradient boosting decision tree, a model named GBDTCDA was proposed by Lei et al. in 2019 [25]. A computational model called RWLR uses logistic regression to predict circRNAdisease associations [26]. The fifth category is the deep learning methods. For example, Wang's method [27] applies a convolutional neural network to discover unknown circRNAdisease associations. In 2020, GCNCDA was proposed based on a graph convolutional network [28].
In this paper, we propose a novel computational model named GATCDA to predict circRNA-disease associations with graph attention network (GAT). First, we construct a circRNA-disease association network, a circRNA-miRNA association network and a disease-mRNA association network. Second, we calculate disease symptom similarity, network similarity and information entropy similarity for both circRNAs and diseases.
Third, these similarities are integrated to create the features of circRNAs and diseases. Fourth, the circRNA-disease association network and the features of circRNAs and diseases are fed into GAT, and the output is the prediction score of associations between circRNAs and diseases.
An initial disease-mRNA association dataset was downloaded from DisGeNET [30] including 60,000+ disease-mRNA interactions. We selected 37 diseases common to circRNAdisease interactions and disease-mRNA interactions. There were 820 mRNAs related to these 37 diseases in the initial disease-mRNA interactions. Finally, we constructed a new disease-mRNA association network including 1239 disease-mRNA associations among 37 diseases and 820 mRNAs.
CircRNAs act as miRNA sponges in cells and increase the expression level of target genes. The circRNA-miRNA-mRNA axis plays an important regulatory role in diseases [2,31,32]. In the new circRNA-miRNA associations, we found all the miRNAs related with diseases from circRNA-disease associations or mRNAs from disease-mRNA associations. Furthermore, in the new disease-mRNA associations, we found all the mRNAs related to the 125 of 142 miRNAs mentioned above.

Construction of the Interaction Network
For convenience, we formulate circRNA-disease associations as a binary matrix Y ∈ R 624×102 . If there exists an experimentally verified interaction between circRNA c i and disease d j , Y(i, j) = 1; otherwise, Y(i, j) = 0. At the same time, a circRNA-miRNA interaction matrix and a disease-mRNA interaction matrix are constructed in the same way based on circRNA-miRNA associations and disease-mRNA associations, respectively.

Network Similarity
Zhou et al. [33] demonstrated the usefulness of network similarity. For a given miRNA mi k , we denote the set of its interacting circRNAs by C(mi k ). For a given mRNA m k , we denote the set of its interacting diseases by D(m k ). The network contribution of miRNA mi k in the circRNA-miRNA interaction network can be calculated as follows where nc(mi k ) is the network contribution of miRNA mi k in the circRNA-miRNA interaction network, and y is the number of miRNAs. The network contribution of mRNA m k in the disease-mRNA interaction network can be calculated as follows where nc(m k ) is the network contribution of mRNA m k in the disease-mRNA interaction network, and z is the number of mRNAs. We also denote the set of miRNAs that interact with a given circRNA c u by Mi(c u ), and the set of mRNAs that interact with a given disease d u by M(d u ). The network similarity between circRNA c u and circRNA c v can be defined as where CNS(c u , c v ) is the network similarity between circRNA c u and circRNA c v . Similarly, given two diseases, d u and d v , the network similarity between disease d u and disease d v can be defined as where DNS(d u , d v ) is the network similarity between disease d u and disease d v .

Information Entropy Similarity
Information entropy is also used to measure topological similarities of circRNAs and diseases. For a given circRNA c u , we denote the set of its interacting diseases by T c u m . For a given circRNA c v , we denote the set of its interacting circRNAs by T c v m . Next, the information entropy of T c u m can be calculated as where nd is the number of diseases related with circRNA c u , N cd is the total number of known circRNA-disease interactions, n(T c u m (i)) is the number of interactions between the ith disease in the related disease set of circRNA c u and all circRNAs, and p(T c u m (i)) is the rate of the ith disease in the related disease set of circRNA c u with the known circRNAdisease interactions. The information entropy similarity between circRNA c u and circRNA c v can be calculated as where Similarly, the information entropy similarity between disease d u and disease d v can calculated as follows where T d u n is the set of disease d u s interacting circRNAs, T d v n is the set of disease d v s interacting circRNAs, H(T d u n ∩ T d v n ) is the information entropy of the intersection of T d u n and T d v n , and DES(d u , d v ) is the information entropy similarity of disease d u and disease d v .

Disease Symptom Similarity
According to the co-occurrence of diseases and symptom terms recorded in the PubMed bibliography, and the work of Zhou et al. [34], the disease similarity can be measured and a symptom-based human disease network can be constructed. Here, the symptom-based disease similarity matrix DSS was obtained from the symptom profiles of diseases.

Integration of Similarities
The integrated circRNA similarities and integrated disease similarities are regarded as circRNA features and disease features, respectively. The integrated circRNA similarities can be calculated as follows: where CNS(c u , c v ) is the circRNA network similarity between circRNA c u and circRNA c v , CES(c u , c v ) is the circRNA information entropy similarity between circRNA c u and circRNA c v , and ICS(c u , c v ) is the integrated circRNA similarity between circRNA c u and circRNA c v . β is an adjusting parameter. The dimensions of the ICS matrix are 624 × 624.
The integrated disease similarities can be calculated as follows

Graph Attention Network
GAT [35] combines a weighted sum of the adjacent node features with the attention mechanism. The weight of the adjacent node features is completely dependent on the node features and independent of graph structure. GAT aims to construct a hidden self-attention layer and to learn representations for nodes on a graph by assigning different weights to different nodes in a neighborhood.
The input of graph attention layer is where N is number of nodes (all circRNAs and all diseases), F is the length of features, and matrix f ∈ R N×F denotes the features of all nodes. The output of the graph attention layer is where F denotes the dimension of new features and matrix f ∈ R N×F denotes the new features of all nodes. The first step is to learn the importance of the neighbors for a given node. GAT implements the self-attention mechanism for every node. The attention coefficient e ij for an association pair between circRNA c i and disease d j is formulated as follows where att denotes a single-layer feed-forward neural network that transforms input features into high-level features for circRNAs and diseases, and W ∈ R F ×F is a weight matrix.
To make the attention coefficient comparable across different nodes, GAT further normalizes the attention coefficient e ij as follows where N i is the set of neighbor nodes of circRNA c i , and θ ij is the normalized attention coefficient indicating the importance of disease d j for circRNA c i in the process of information propagation. By combining Formulas (12) and (13), the complete attention mechanism can be obtained as follows where leakyReLu is a nonlinearity activation function assigning all negative values a nonzero slope, T denotes transposition, || is the concatenation operation, and a ∈ R 2F is the weight coefficient matrix of the graph attention layer. The second step is to fuse the representations of the neighbors for a given node according to their attention coefficients. The embedding of a given node can be fused by the projected node features of neighbors with different weights as follows where σ is a nonlinear activation function.
GAT applies a multi-head attention mechanism to increase the stability of the learning process of self-attention. Multi-head attention is the combination of multiple self-attention structures. Each head learns the features in different representation spaces, and the focus of attention learned by multiple heads may be slightly different, which increases the capacity of the model. Specifically, K-independent attention mechanisms are integrated to achieve embedding as follows: where K is the number of attention mechanisms and W k is the weight matrix for the kth attention mechanism. Ultimately, the probability score matrix S can be calculated as follows where U ∈ R nc×F is the final representation matrix of the circRNAs, in which nc is the number of circRNAs; and V ∈ R nd×F is the final representation matrix of diseases, in which nd is the number of diseases. The dimension of probability score matrix S is nc × nd.
The detailed procedures of using GAT to predict the associations between circRNAs and diseases are shown in Figure 2. As shown in Figure 2, the circRNA-disease association network is fed into a GAT in which the final node representation is obtained through feature propagation and attention fusion. Finally, the prediction score is calculated according to node representation. In the case of disease d 3 and circRNA c 2 , the dark blue row represents d 3 , the dark yellow row represents c 2 , and the red grid represents the predicted score of the association between d 3 and c 2 .

Performance Evaluation
The five-fold cross-validation (5CV) technique was used to evaluate the prediction performance of our model. The 5CV technique randomly divides the positive samples into five equal parts, and takes out one part of them as a testing sample while the rest of samples are regarded as training samples. Next, the predicted scores are sorted in descending order. We drew the receiver operating characteristics (ROC) curve via plotting the true positive rate (TPR) versus the false positive rate (FPR) at different score thresholds. TPR (FPR) refers to the percentage of positive (negative) cases that are correctly identified. Generally, the area under the ROC curve (AUC) is calculated and employed to evaluate the prediction performance. Specifically, the closer the AUC value is to one, the better the prediction performance. As a result, in 5CV, GATCDA achieved an AUC of 0.9011. In addition, GATCDA yielded an accuracy of 0.8710, with a precision of 0.9013.

Adjustment of Parameters
The GATCDA model involves two parameters, α and β, which adjust the influence of similarity data when calculating integrated similarities. We let α and β both range between 0.1 and 0.9. As a result, GATCDA (α = 0.1, β = 0.1) gained the highest AUCs of 0.9011 in 5CV as shown in Figure 3.

Compared with Other Methods
To analyze the performance of GATCDA in predicting circRNA-disease associations, we compared GATCDA with other four methods: DWNN-RLS [36], KATZHCDA [21], bi-random walks (BiRWR) [37], and DeepWalk [38]. DWNN-RLS uses the regularized least squares of the Kronecker product kernel to predict circRNA-disease associations. KATZHCDA uses KATZ measures for human circRNA-disease association prediction. BiRWR predicts circRNA-disease associations by walking in a circRNA subnetwork and disease subnetwork. DeepWalk is a way to learn the potential representation of nodes in a graph structure. The ROC curve and AUC value of each method using 5CV are shown in Figure 4. The precision-recall curve and the area under the precision-recall curves (AUPR) value of each method with 5CV are shown in Figure 5. Through comparison, it can be seen that the extraction of circRNA and disease features by GAT can achieve better prediction performance compared with DeepWalk. In addition, as a deep learning method, GAT also shows better prediction performance compared with the other two link-based prediction methods (KATZHCDA and BiRWR).

Case Study
To further evaluate the prediction performance of GATCDA, we also carried out case studies on three common diseases, i.e., bladder cancer, diabetes retinopathy, and rheumatoid arthritis.
Bladder cancer is the most frequent cancer affecting the urinary tract [39], and has a high rate of recurrence [40]. Diabetes retinopathy is a common chronic metabolic disorder, increasing with an ageing population and the growing number of cases of diabetes [41]. Rheumatoid arthritis is the most common chronic inflammatory arthritis, which can lead to cartilage and bone damage and disability [42]. There is increasing evidence that circRNAs can be used as effective biomarkers for the diagnosis of bladder cancer, diabetes retinopathy, and rheumatoid arthritis. Therefore, we selected bladder cancer, diabetes retinopathy, and rheumatoid arthritis to verify the predictive ability of GATCDA.
In this work, all known associations between the investigated disease and circRNAs were assumed to be unknown. Through the calculation of GATCDA, the circRNAs with the top 10 scores were selected among all the predicted associations between the investigated disease and circRNAs. Then, through searching the related literature, some circRNAs were confirmed to be related to the investigated disease.
The results of the case studies of the three diseases (bladder cancer, diabetes retinopathy, and rheumatoid arthritis) are shown in Table 1. For bladder cancer, we can see that 8 of the top 10 candidates with the highest prediction scores are confirmed by the relevant literature. Notably, the seventh circRNA (hsa_circ_0075828) predicted by GATCDA is related to bladder cancer. For diabetes retinopathy and rheumatoid arthritis, 7 of the top 10 candidates with the highest prediction scores are confirmed by the relevant literature. For example, Li et al. found that hsa_circ_0001859 regulates ATF2 expression by functioning as an MiR-204/211 sponge in human rheumatoid arthritis [43]. Zhang et al. revealed that hsa_circ_0005015 acts as an miR-519d-3p sponge to inhibit miR-519d-3p activity, leading to increasing MMP-2, XIAP, and STAT3 expression in diabetes retinopathy [44].
In order to verify the prediction performance of GATCDA, we compared it with other models in the case studies of these three kinds of same diseases, as shown in Table 2. From Table 2, 8, 7, and 7 out of these top 10 circRNAs predicted by GATCDA were verified to be associated with bladder cancer, diabetes retinopathy, and rheumatoid arthritis, respectively, which are the highest among competing methods. Therefore, GATCDA also outperforms the competing prediction models in terms of the hit rate in case studies.

Discussion
With the rapid development in RNA sequencing technology and bioinformatics analysis, various studies have shown that circRNAs are closely related to the occurrence and development of disease, and circRNAs act a potential biomarkers for patients with certain cancers. Therefore, discovering associations between circRNAs and diseases is significative for disease diagnosis and treatment. Nevertheless, biological experiments are very costly in terms of time and money. It has become a hot topic to predict associations between circRNAs and diseases using computational methods in recent years. The lack of data on the interactions between circRNA and disease limits the predictive power of most computational methods.
In this study, we proposed a new computational model named GATCDA to identify underlying circRNA-disease associations. We performed 5CV experiments to assess the predictive performance of GATCDA. Our method yielded an AUC value of 0.9011 and an AUPR value of 0.896, which are higher than those of DWNN-RLS, KATZHCDA, BiRWR, and DeepWalk. In addition, the predicted top 10 circRNA-disease interactions in the case studies of three diseases (bladder cancer, diabetes retinopathy, and rheumatoid arthritis) have been confirmed in the relevant literature, which suggests that GATCDA can be an effective tool for predicting circRNA-disease associations.
The accurate predictive performance of GATCDA is attributed to the following factors: First, in order to identify more interactions between circRNAs and diseases, circRNAdisease interactions were integrated from four databases, i.e., CircR2Disease, CircAtlas 2.0, Circ2Disease, and CircRNADisease. Therefore, the number of positive samples input to GAT algorithm is higher. Second, as the circRNA-miRNA-mRNA axis plays an important role in the generation and development of diseases, the circRNA-miRNA interactions and disease-mRNA interactions were adopted to construct features, in which mRNAs are related to 88% of miRNAs. CircRNAs have several distinct modes of action. From the functional perspective of circRNAs as miRNA sponges, an interaction network can be constructed for circRNA network similarity calculations. Other functions of circRNA are difficult to quantify. Third, more similarities are involved in GATCDA, i.e., disease symptom similarity, disease network similarity, disease information entropy similarity, circRNA network, and circRNA information entropy similarity, which are integrated effectively. Fourth, GAT has obvious advantages, including learning representations for nodes on a graph using an attention mechanism. Therefore, it can assign different weights to different nodes in a neighborhood.
GATCDA also has limitations. Compared with other non-coding RNAs, the interactions between circRNAs and diseases are still insufficient. Therefore, the circRNA-disease association matrix is sparse, which has an impact on prediction performance. In the future, we will collect more data on the associations between circRNAs and diseases.

Conclusions
In this study, we proposed a new computational model named GATCDA to identify underlying circRNA-disease associations. Specifically, GAT was used to predict circRNAdisease associations based on multiple similarities of circRNA and disease. This work has two highlights: First, as the circRNA-miRNA-mRNA axis plays an important role in the generation and development of diseases, circRNA-miRNA interactions and disease-mRNA interactions are adopted to construct features, in which mRNAs are related to 88% of miRNAs. Second, GAT is used to predict the interactions between circRNAs and diseases. GAT can assign different learning weights to different neighbors, and the correlation between vertex features can be better integrated into the model. In terms of predictive performance, GATCDA achieves an AUC of 0.9011 in 5CV, and in case studies of three diseases, 70% of experimentally validated relationships were predicted. In summary, GATCDA is a powerful tool for predicting circRNA-disease associations.

Data Availability Statement:
The data presented in this study are available on request from the corresponding authors.