LRGCPND: Predicting Associations between ncRNA and Drug Resistance via Linear Residual Graph Convolution

Accurate inference of the relationship between non-coding RNAs (ncRNAs) and drug resistance is essential for understanding the complicated mechanisms of drug actions and clinical treatment. Traditional biological experiments are time-consuming, laborious, and minor in scale. Although several databases provide relevant resources, computational method for predicting this type of association has not yet been developed. In this paper, we leverage the verified association data of ncRNA and drug resistance to construct a bipartite graph and then develop a linear residual graph convolution approach for predicting associations between non-coding RNA and drug resistance (LRGCPND) without introducing or defining additional data. LRGCPND first aggregates the potential features of neighboring nodes per graph convolutional layer. Next, we transform the information between layers through a linear function. Eventually, LRGCPND unites the embedding representations of each layer to complete the prediction. Results of comparison experiments demonstrate that LRGCPND has more reliable performance than seven other state-of-the-art approaches with an average AUC value of 0.8987. Case studies illustrate that LRGCPND is an effective tool for inferring the associations between ncRNA and drug resistance.


Introduction
Non-coding RNAs (ncRNAs) play special roles in the development, differentiation, and aging of cells. Numerous studies have shown that ncRNAs are widely involved in human pathological activities. They act as biomarkers to provide new targets for the treatment of diseases such as cancer [1]. Non-coding RNAs such as microRNAs (miRNAs), circular RNAs (circRNAs), and long ncRNAs (lncRNAs) have aroused great interest of researchers. miRNAs are short regulatory biomolecules that are involved in the post-transcriptional regulation of gene expression [2]. Compared with linear miRNAs, circRNAs [3] are more stable and may function as transporters or scaffolds [4]. They exert essential biological functions by acting as microRNA or protein inhibitors ("sponges"), regulating protein function, or being translated themselves [5]. lncRNA can play a role in regulating cooperating proteins [6]. piRNA (Piwi-Interacting RNA) has been relatively poorly studied compared to those three. piRNA can form a piRNA/PIWI complex with PIWI proteins to affect gene expression and mainly function to suppress the activity of transposons [7,8]. There are synergies among RNAs. For example, lncRNA can act as a molecular sponge of miRNA to regulate the expression of its target gene [9][10][11][12].
According to a statistical cancer report released by the American Cancer Society [13], it is estimated that there will be approximately 4950 new cancer cases and 1600 deaths due to cancer every day in the United States. Unfortunately, the development of drug resistance greatly increases the probability of recurrence and significantly reduces the cure rate. Drug resistance has become a major obstacle to clinical treatment. tance validated interactions. Initially, LRGCPND constructs a bipartite graph through the association network of ncRNA and drug resistance, where the edges represent the hidden interaction factors between the two types of nodes. The unconnected edges may have associations that are not obvious to identify. LRGCPND then fleetly aggregates the intrinsic characteristics of neighbor nodes in the former layer and performs the linear transition. After the specified number of iterations, it fuses the embeddings of previous convolutional layers through residual learning to favorably explore the interactions between ncRNA and drug resistance. LRGCPND achieves the best performance compared with the other advanced computational methods. Case studies of two anti-cancer drugs demonstrate the practical capability of LRGCPND. The flow chart of LRGCPND is shown in Figure 1. unknown potential associations. Here we propose an efficient approach based on a linear residual graph convolutional network, LRGCPND, which only employs ncRNA and drug resistance validated interactions. Initially, LRGCPND constructs a bipartite graph through the association network of ncRNA and drug resistance, where the edges represent the hidden interaction factors between the two types of nodes. The unconnected edges may have associations that are not obvious to identify. LRGCPND then fleetly aggregates the intrinsic characteristics of neighbor nodes in the former layer and performs the linear transition. After the specified number of iterations, it fuses the embeddings of previous convolutional layers through residual learning to favorably explore the interactions between ncRNA and drug resistance. LRGCPND achieves the best performance compared with the other advanced computational methods. Case studies of two anti-cancer drugs demonstrate the practical capability of LRGCPND. The flow chart of LRGCPND is shown in Figure 1. In the feature aggregation step, we use the spectral rule to aggregate the features of neighboring nodes. After that, the linear transformation is adopted to speed up the forward propagation. Finally, we add a residual block to fuse the characteristics of low-layer nodes directly, attaining higher-layer potential features.

Experimental Setup
To objectively and systematically evaluate the ability of LRGCPND and expedite comparison with other methods, we perform -fold cross-validation ( -fold CV) on the collected dataset. All verified associations are randomly divided into parts. Each part is picked as positive samples with an equal quantity of unlabeled samples as negative samples to form the testing set. Meanwhile, the equivalent operation is performed on the remaining − 1 parts to obtain the training set. This process ends after iterations.
Even if there may be latent associations in the selected negative samples, since they account for a tiny proportion in the entire unverified sample set, the influence is negligible. n and e 0 r denote the embedding of ncRNA n and drug d in layer 0, respectively. LRGCPND contains three steps: aggregation, linear transition, and residual learning. In the feature aggregation step, we use the spectral rule to aggregate the features of neighboring nodes. After that, the linear transformation is adopted to speed up the forward propagation. Finally, we add a residual block to fuse the characteristics of low-layer nodes directly, attaining higher-layer potential features.

Experimental Setup
To objectively and systematically evaluate the ability of LRGCPND and expedite comparison with other methods, we perform k-fold cross-validation (k-fold CV) on the collected dataset. All verified associations are randomly divided into k parts. Each part is picked as positive samples with an equal quantity of unlabeled samples as negative samples to form the testing set. Meanwhile, the equivalent operation is performed on the remaining k − 1 parts to obtain the training set. This process ends after k iterations.
Even if there may be latent associations in the selected negative samples, since they account for a tiny proportion in the entire unverified sample set, the influence is negligible.

Evaluation Criteria
To observe intuitively and comprehensively, we measure the performance of models by widely adopted metrics, including AUC, AUPR, Accuracy (Acc.), Precision (P.), Recall (R.), and F1 scores, which are defined by the following formula: TP and FP represent the number of correct and incorrect classifications in the related ncRNA-drug resistance pairs. In contrast, TN and FN represent the number of correct and incorrect classifications in the unrelated pairs. By adjusting the threshold, we can plot the receiver operating characteristic (ROC) curve and precision-recall (PR) curve and then calculate the area under the curves to get AUC and AUPR, respectively.

Performance Evaluation for LRGCPND
To evaluate the identification ability of our model, we performed five-fold and ten-fold CV on the dataset specified above. Table 1 lists the specific results in five-fold CV, and Figure 2 displays the ROC curves. In five-fold CV, the average values of AUC, AUPR, and Accuracy reach 0.8987, 0.9094, 0.8342, respectively. With the increasing size of the training set, training of the model will achieve a more thorough level. So, in ten-fold CV, the AUC increased to 0.9052. As seen from the above experimental results, LRGCPND can accurately and effectively identify potential ncRNAs related to drug resistance.

Evaluation Criteria
To observe intuitively and comprehensively, we measure the performance of models by widely adopted metrics, including AUC, AUPR, Accuracy (Acc.), Precision (P.), Recall (R.), and F1 scores, which are defined by the following formula: TP and FP represent the number of correct and incorrect classifications in the related ncRNA-drug resistance pairs. In contrast, TN and FN represent the number of correct and incorrect classifications in the unrelated pairs. By adjusting the threshold, we can plot the receiver operating characteristic (ROC) curve and precision-recall (PR) curve and then calculate the area under the curves to get AUC and AUPR, respectively.

Performance Evaluation for LRGCPND
To evaluate the identification ability of our model, we performed five-fold and tenfold CV on the dataset specified above. Table 1 lists the specific results in five-fold CV, and Figure 2 displays the ROC curves. In five-fold CV, the average values of AUC, AUPR, and Accuracy reach 0.8987, 0.9094, 0.8342, respectively. With the increasing size of the training set, training of the model will achieve a more thorough level. So, in ten-fold CV, the AUC increased to 0.9052. As seen from the above experimental results, LRGCPND can accurately and effectively identify potential ncRNAs related to drug resistance.

Effects of Parameters
For LRGCPND, there are two crucial parameters: the depth of propagation and the dimension of embedding, which influence the prediction capability. For one thing, we explored the impact of layer depth K, following the settings of other parameters constant. When K ranges from 1 to 5, we performed five-fold CV. Table 2 lists the detailed data, and Figure 3 is the trend chart of different indicators. Our model achieves the best performance when K is equal to 4. For LRGCPND, there are two crucial parameters: the depth of propagation and the dimension of embedding, which influence the prediction capability. For one thing, we explored the impact of layer depth , following the settings of other parameters constant. When ranges from 1 to 5, we performed five-fold CV. Table 2 lists the detailed data, and Figure 3 is the trend chart of different indicators. Our model achieves the best performance when is equal to 4.  For another thing, the embedding dimension also has a critical role. When setting the value of to 8, 16, 32, 64, 128 sequentially, we conducted five-fold CV to measure the impact on the prediction ability of our model. Table 3 shows the detailed statistics, and Figure 4 indicates the trend of diverse metrics. From the results, we can conclude that when varies from 8 to 128, the performance first monotonically improves. That is because the larger embedding dimension enhances the expressivity of LRGCPND to a For another thing, the embedding dimension S also has a critical role. When setting the value of S to 8, 16, 32, 64, 128 sequentially, we conducted five-fold CV to measure the impact on the prediction ability of our model. Table 3 shows the detailed statistics, and Figure 4 indicates the trend of diverse metrics. From the results, we can conclude that when S varies from 8 to 128, the performance first monotonically improves. That is because the larger embedding dimension enhances the expressivity of LRGCPND to a certain extent. When S is 32, it reaches the optimum. Then as S increases, it starts to produce adverse effects on the performance.

Comparison with Other Approaches
Since inferring ncRNA-drug resistance interactions is a relatively new area, no researchers have proposed relevant solutions already. Nonetheless, reviewing other association prediction methods in bioinformatics still provides significant references for the performance of our model. To further assess the effectiveness of LRGCPND, we compared it with seven advanced approaches in directions of lncRNA-disease, circRNA-disease, and microbe-disease.
For the sake of rigor, we need to point out that since AE-RF [29] and ABHMDA [33] employ other similarity-based features besides the Gaussian interaction profile (GIP) kernel similarity. Considering the scarcity of relevant biological resources and convenience, we only calculated the GIP similarity for them in the experiments. Furthermore, the adjacency matrix allocated at the beginning of training is different, so the topology information of the interaction network needs to be re-extracted. We re-calculated the GIP similarity matrices during each cross-validation process for similarity-based methods, AE-RF, KATZHMDA [32], NTSHMDA [35], and ABHMDA. As plotted in Figure 5, it is evident that LRGCPND leads others with the average AUC value of 0.8987, which is 5.84% higher than the second-best method DMFMDA [34]. In other experiments, we employ the optimal values obtained above as the default of model parameters.

Comparison with Other Approaches
Since inferring ncRNA-drug resistance interactions is a relatively new area, no researchers have proposed relevant solutions already. Nonetheless, reviewing other association prediction methods in bioinformatics still provides significant references for the performance of our model. To further assess the effectiveness of LRGCPND, we compared it with seven advanced approaches in directions of lncRNA-disease, circRNA-disease, and microbe-disease.
For the sake of rigor, we need to point out that since AE-RF [29] and ABHMDA [33] employ other similarity-based features besides the Gaussian interaction profile (GIP) kernel similarity. Considering the scarcity of relevant biological resources and convenience, we only calculated the GIP similarity for them in the experiments. Furthermore, the adjacency matrix allocated at the beginning of training is different, so the topology information of the interaction network needs to be re-extracted. We re-calculated the GIP similarity matrices during each cross-validation process for similarity-based methods, AE-RF, KATZHMDA [32], NTSHMDA [35], and ABHMDA. As plotted in Figure 5, it is evident that LRGCPND leads others with the average AUC value of 0.8987, which is 5.84% higher than the second-best method DMFMDA [34].
From statistics of various metrics listed in Table 4, except that the Recall value is slightly lower than ABHMDA, our model yields the optimal identification ability. Its AUPR, Accuracy, and F1 values achieve 0.9094, 0.8342, 0.8335, respectively. We also drew a radar chart to intuitively and comprehensively measure the capabilities of diverse models through various metrics, as shown in Figure 6. All six evaluation metrics range from 0.4 to 1.0. The farther the point from the center of the circle, the higher the value. It is also apparent to conclude that LRGCPND advantages over other methods.  Table 4, except that the Recall value is slightly lower than ABHMDA, our model yields the optimal identification ability. Its AUPR, Accuracy, and F1 values achieve 0.9094, 0.8342, 0.8335, respectively. We also drew a radar chart to intuitively and comprehensively measure the capabilities of diverse models through various metrics, as shown in Figure 6. All six evaluation metrics range from 0.4 to 1.0. The farther the point from the center of the circle, the higher the value. It is also apparent to conclude that LRGCPND advantages over other methods.

From statistics of various metrics listed in
These experimental results sufficiently demonstrate that our model is reliable and promising in inferring candidate ncRNA-drug resistance pairs.    From statistics of various metrics listed in Table 4, except that the Recall value is slightly lower than ABHMDA, our model yields the optimal identification ability. Its AUPR, Accuracy, and F1 values achieve 0.9094, 0.8342, 0.8335, respectively. We also drew a radar chart to intuitively and comprehensively measure the capabilities of diverse models through various metrics, as shown in Figure 6. All six evaluation metrics range from 0.4 to 1.0. The farther the point from the center of the circle, the higher the value. It is also apparent to conclude that LRGCPND advantages over other methods.
These experimental results sufficiently demonstrate that our model is reliable and promising in inferring candidate ncRNA-drug resistance pairs.  These experimental results sufficiently demonstrate that our model is reliable and promising in inferring candidate ncRNA-drug resistance pairs.

Case Studies
The discovery of unknown associations between ncRNA and drug resistance matters tremendously for practical application. Thus, we selected two drugs, Cisplatin and Paclitaxel, and conducted case studies. Precisely, for a particular drug, to start with, we removed the known associated ncRNAs. Then, the remaining ncRNAs were sorted in descending order following the values predicted by LRGCPND. Lastly, we screened the top 15 ncRNAs and searched for supporting evidence in published literature.
Cisplatin is a common chemotherapeutic drug used to treat numerous cancers, including lung cancer, head and neck cancer, and ovarian cancer. Resistance frequently causes reduced efficacy of Cisplatin in chemotherapy [36]. Paclitaxel is another widely applied taxane medication. Chemoresistance to Paclitaxel makes its clinical application problematic [37]. Tables 5 and 6 summarize the top 15 candidate ncRNAs of Cisplatin and Paclitaxel, respectively. We can see that 10 and 7 of the former and the latter are confirmed by existing evidence, indicating that our method has an excellent capability for predicting novel associated ncRNAs for drugs in terms of resistance. It is noteworthy that other unproven associations are likely to exist and deserve further relevant experiments.

Datasets
NoncoRNA: NoncoRNA [23] contains 5568 ncRNAs and 154 drugs in 134 cancers. This is the first database that provides diverse ncRNAs and associations between ncRNAs and drug resistance in cancers. We use the Feb 2020 version of the NoncoRNA database, which is publicly released at http://www.ncdtcdb.cn:8080/NoncoRNA (accessed on 10 March 2021).
ncDR: Hitherto, one of the most frequently used databases is ncDR [22] in the field of drug resistance-related non-coding RNA. Here, we adopt the data downloaded from the June 2016 version of the ncDR database. The dataset contains 5864 associations between ncRNAs and drug resistance, including 877 miRNAs and 162 lncRNAs from nearly 900 pieces of published literature. It now can be available on the website http://www.jianglab.cn/ncDR (accessed on 10 March 2021).
We manually integrated a set of 2693 associations between ncRNAs and drug resistance from NoncoRNA and ncDR datasets, including 625 ncRNAs and 121 drugs. Here we choose the experimental data. Besides, we clean the dataset by removing the redundant ones and associations in which a ncRNA only contains one drug resistance binding. The dataset can be expressed as: where R + represents the positive dataset, which contains 2693 ncRNA-drug resistance associations verified with wet experiment. R − represents the negative dataset, which contains a total of 72,932 ncRNA-drug resistance associations without verified experimentally. Earlier in Section 2.1, we have introduced the detail of sampling. Our dataset can be downloaded on the website https://github.com/TroyePlus/LRGCPND (accessed on 30 July 2021).

Problem Description
In order to predict the relationship between ncRNA and drug resistance, for a given set of m ncRNAs and n drugs, we use U = {u 1 , u 2 , . . . , u m } and V = {v 1 , v 2 , . . . , v n } respectively represent the collection of ncRNAs and drugs, and R ∈ R m×n is the correlation matrix. If ncRNA u i is related to drug resistance v j , then the entry R ij = 1, otherwise R ij = 0. However, R ij = 0 does not mean that ncRNA u i has no association with the drug v j . It may be that the relationship has not been found yet. In addition, we use V + i = v j v j ∈ V and R ij = 1 to represent the linked set of ncRNA u i found, and V − i = V/V + i to represent the non-linked set. D = u i , v j R ij = 1 is defined as the set of all linked ncRNA and drug resistance pairs.

Graph Construction
We use a bipartite graph G(U, V, E) to express the associations between different ncRNAs and drug resistance, where U, V are the previously defined ncRNA set and drug set. Every edge e belonging to E represents a verified association between ncRNA u and drug resistance v.

Graph Embedding
Matrix factorization is a common method of graph embedding. Matrix factorization only uses the linear relationship between entities and can be applied to data that only contains associations. However, the matrix factorization method cannot make full use of data information, and its ability to extract high-order features is weak. In recent years, graph-based models have become popular in the field of semi-supervised classification. The network built by graphs combined with deep learning methods can be applied to graph embedding to obtain vector representations of graphs or graph nodes [38]. Graph convolutional neural network is often used in the field of association prediction in biological information. The design of graph convolutional neural network is inspired by convolutional neural network, which is widely used in the field of computer vision. Its advantage is that it can extract the structural features of node neighborhoods and then learn higher-order relationships. But obvious disadvantages are the over-smoothing problem and time-consuming calculation. In this work, the task of ncRNA-drug resistance association is similar to the recommendation problem, where ncRNA corresponds to the user, and the drug resistance is equivalent to the project. The verified association is equal to the user's viewing/shopping history. Therefore, the graph convolutional neural network method, which is very popular in the recommendation task, can be applied to our problem. Here, we solve the above problems with linear propagation and residual block based on GCN. We first construct the adjacency matrix A of the bipartite graph G as follows: Then use E to represent the embedding matrix of ncRNA and drug resistance. We generate initial values from the normal distribution given standard deviation = 0.1 to fill the initial embedding matrix with nn.initial.normal. Every epoch in training, LRGCPND treats the embedding matrix as input: where E is calculated in each iteration and will be updated.

Feature Aggregation
There is no intra-domain edge in the bipartite graph, so the message passing and node feature aggregation are only performed through the inter-domain edge for the convolution of the bipartite graph. We use the spectral rule to aggregate feature of graph: where ∼ A = A + I, I is the identity matrix. ∼ D is the degree matrix of ∼ A. As is adopted widely in GCN, spectral rule considers not only the degree of ith node, but also the degree of the jth node when calculating the aggregation of the ith node.

Linear Transition
We remove the nonlinear transformation functions at the end of each layer. Despite the linear propagation of LRGCPND, the "receptive field" of our model is the same as a K-layer GCN. The k + 1 step embedding could be calculated as: where W k represents the linear transformation, E k is the k step embedding. Due to the linear transformation, we can get the matrix form to model each ncRNA n's and drug-resistance r's embedding: where d is the diagonal degree of ncRNA n(drug-resistance r) in G. R n (R r ) represents the neighbors of node n (r) in G.

Residual Block in LRGCPND
In a graph convolution network, there is an over-smoothing problem caused by network stacking. The role of GCN is equivalent to low-pass filtering, making the input signal smoother, which is an inherent advantage of the GCN model. However, after multiple executions of GCN operations, the signals will tend to be the same, so the diversity of node characteristics is lost, which is a fatal disadvantage for tasks related to node classification. From the perspective of the spectral domain, analyzing the frequency response function of GCN points out that if the smoothing operation is continuously performed on a graph signal, the graph signal will eventually become equal everywhere, ultimately losing the discrimination information between nodes. Here we adopt the residual block [39] proposed by Kaiming He to establish identity mapping. The output of our model can be described as:ô k+1 nr =ô k nr + e k+1 n · e k+1 r (12)

Model Optimization
BPR [40] is a sorting algorithm based on matrix decomposition. It is not a global scoring optimization but a sorting optimization for each ncRNA's related drug-resistance preferences. It is a pairwise sorting algorithm. For each triple < n, i, j >, the model hopes to make the ncRNA n's difference between drug-resistance i and j more obvious.
where Θ 1 = E 0 , Θ 2 = [W], E is updated after the model backward propagation. λ controls the strength of L 2 regularization. R a denotes the positive subset for a of drug set V. D a = {(i, j)|i ∈ R a ∧ j ∈ V − R a } represents the pairs containing positive sample i and negative sample j.

Conclusions
Drug resistance response has caused vital challenges to clinical treatment. Numerous studies have indicated that ncRNA plays a pivotal role in the mechanisms of drug resistance. Accurately identifying the ncRNA-drug resistance association pairs is conducive to drug development and promotes clinical treatment. In this work, we propose LRGCPND, a graph convolutional network computational framework for mining the latent associations between ncRNA and drug resistance through linear transition and residual prediction. To our best knowledge, this is the first computational prediction approach in this field. We represent the relationship between ncRNA and drug resistance in a bipartite graph and exploit limited information to learn complex latent factors for edge prediction. LRGCPND first captures the neighborhood representations by aggregation. Then, it performs feature transformation through linear operations. Finally, the embedding vectors of convolutional layers are concatenated through residual blocks to achieve prediction.
Experimental results and case studies corroborate the effectiveness of our model, to which several aspects may contribute. We utilize graph convolution to perform relatively more adequate representation learning on the original association data with inadequate information. Residual blocks enable the model to attain higher-layer potential characteristics, and linear feature propagation keeps the model lightweight and flexible to extend to datasets on a large scale. In conclusion, our model is promising and facilitates further research in predicting novel associated ncRNAs for drug resistance. Our study helps build a systematic map of ncRNA and drug resistance, provides more insights into drug resistance, and aids in identifying effective therapeutic combinations.
As with many computational prediction methods, LRGCPND also has its limitations. First, LRGCNPND only utilizes ncRNA-drug resistance association data. The quality and coverage of the association data would affect the performance. Second, LRGCPND makes predictions with ncRNAs containing subtypes. Despite this provides insights from a broader perspective, differences between subtypes would cause bias. In the future, we will combine the attention mechanism and integrate multiple heterogeneous data to improve the performance further.