Next Article in Journal
Disease Activity-Dependent Siglec-1 Expression on Monocyte Subsets of Patients with Idiopathic Inflammatory Myopathies
Previous Article in Journal
Activation of the NALP3-CASP1-IL-1 β Inflammatory Pathway by Pesticide Exposure in Human Umbilical Vein Endothelial Cells
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CoupleMDA: Metapath-Induced Structural-Semantic Coupling Network for miRNA-Disease Association Prediction

1
School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
2
School of AI for Science, Peking University, Beijing 100871, China
3
State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
4
Department of Medical Research, China Medical University Hospital, Taichung 40447, Taiwan
*
Authors to whom correspondence should be addressed.
Int. J. Mol. Sci. 2025, 26(10), 4948; https://doi.org/10.3390/ijms26104948
Submission received: 22 March 2025 / Revised: 18 May 2025 / Accepted: 19 May 2025 / Published: 21 May 2025
(This article belongs to the Section Molecular Informatics)

Abstract

The prediction of microRNA-disease associations (MDAs) is crucial for understanding disease mechanisms and biomarker discovery. While graph neural networks have emerged as promising tools for MDA prediction, existing methods face critical limitations: (1) data leakage caused by improper use of Gaussian interaction profile (GIP) kernel similarity during feature construction, (2) self-validation loops in calculating miRNA functional similarity using known MDA data, and (3) information bottlenecks in conventional graph neural network (GNN) architectures that flatten heterogeneous relationships and employ over-simplified decoders. To address these challenges, we propose CoupleMDA, a metapath-guided heterogeneous graph learning framework coupling structural and semantic features. The model constructs a biological heterogeneous network using independent data sources to eliminate feature-target space coupling. Our framework implements a two-stage encoding strategy: (1) relational graph convolutional networks (RGCN) for pre-encoding and (2) metapath-guided semantic aggregation for secondary encoding. During decoding, common metapaths between node pairs structurally guide feature pooling, mitigating information bottlenecks. The comprehensive evaluation shows that CoupleMDA achieves a 2–5% performance improvement over the current state-of-the-art baseline methods in the heterogeneous graph link prediction task. Ablation studies confirm the necessity of each proposed component, while case analyses reveal the framework’s capability to recover cancer-related miRNA-disease associations through biologically interpretable metapaths.

1. Introduction

MicroRNAs (miRNAs), a class of approximately 22-nucleotide non-coding RNA molecules, regulate key biological processes including cell proliferation, differentiation, and apoptosis through targeted gene expression modulation [1,2,3]. Their mechanisms of action encompass mRNA degradation induction, translational repression, and epigenetic regulation [4]. Studies have demonstrated that aberrant miRNA expression correlates with various diseases: for instance, downregulation of miR-103/107 associates with amyloid-beta deposition in Alzheimer’s disease [5], while miR-105 shows elevated expression in breast cancer [6]. These findings underscore miRNAs’ potential as disease biomarkers [7].
Conventional wet-lab methods for identifying MDAs face critical bottlenecks, including prolonged experimental cycles and high per-test costs [8]. This has driven the development of bioinformatics approaches based on the functional association hypothesis, which posits that phenotypically similar diseases are associated with functionally related miRNAs. Capitalizing on the strong representation learning capabilities of deep learning in graph-structured data processing, GNN-based frameworks have emerged as cutting-edge solutions for discovering potential MDAs in complex biological networks [9,10]. Among these, heterogeneous graph attention network variants that learn topological features through node embedding and neighborhood aggregation have shown particular promise. These methods generally follow two operational phases: (1) data acquisition and construction of miRNA-disease similarity matrices and (2) graph structure construction with GNN-based neighbor feature aggregation, followed by decoding through feed-forward neural networks.
Current methodologies for constructing miRNA similarity matrices exhibit significant methodological homogeneity and operational risks of data leakage. First, the similarity matrices derived from miRNA sequences and gene interaction networks are inherently sparse, necessitating supplementation through either filling [11,12,13,14,15] or concatenation [16,17] with Gaussian interaction profile (GIP) kernel similarity matrices. Analysis of methodology descriptions and open-source implementations in references [11,12,13,14,15,16,17] reveals that these approaches typically compute GIP-based similarities using the complete known MDA matrix to generate embeddings, yet directly employ these embeddings in cross-validation without masking test samples. This operational flaw allows positive test samples to participate in similarity computation as prior knowledge. Ablation experiments in MHGTMDA [16] demonstrate that removing node embeddings generated from unmasked GIP data causes predictive AUC to plummet from 96% to 74%, confirming substantial performance inflation from data leakage. Notably, most open-source implementations store fixed miRNA similarity matrices for subsequent cross-validation, inherently introducing data leakage. Our experimental results indicate that under such operational protocols, even simple multilayer perceptrons (MLPs) achieve inflated performance metrics, which fail to reflect models’ true predictive capabilities. Although DAEMKL [18] and MGADAE [19] circumvent GIP-induced leakage through autoencoder architectures, the intrinsic parameter sensitivity of GIP methods and their heavy reliance on known network structures hinder models’ ability to capture nonlinear high-order relationships, frequently resulting in overfitting or underfitting.
Secondly, existing studies [11,12,13,14,18,19,20] predominantly utilize the miRNA functional similarity matrix MISIM developed by Wang et al. (2010) [21], where outdated data represents only a secondary limitation compared with its fundamental methodological flaws. The core issue lies in circular logic: MISIM calculates miRNA functional similarity using pre-existing MDA data, while subsequent research directly incorporates this similarity matrix as input features for MDA prediction models. This approach creates tight coupling between feature and target spaces, where similarity metrics derived from known associations are used to predict homologous relationships, essentially forming a self-validation loop. Such closed-loop designs may lead to over-optimistic model evaluation, significantly compromising real-world generalization capability. Although GRNMF [22] proposed inferring miRNA similarity through gene functional interaction networks (later adopted by MSGCL [23]) to avoid circular reasoning, its reliance on set-wise maximum similarity matching constitutes a manually designed heuristic algorithm capable of capturing only shallow linear relationships.
Consequently, constructing a robust miRNA similarity matrix faces inherent challenges due to deficient data sources: resultant matrices inevitably suffer from either (1) sparsity issues, (2) data leakage with self-validation loops, or (3) limited expressiveness and weak generalization. Similar fundamental constraints equally apply to disease similarity matrix construction.
Various molecular entities, such as proteins, genes, and lncRNAs, provide multi-level biological insights from complementary perspectives. Leveraging additional biological entities as mediators connecting miRNAs and diseases not only addresses data incompleteness in single-molecule analyses but also enhances predictive models with enriched input features [16,24]. HGCNMDA [24] introduces a gene layer to construct a miRNA-gene-disease heterogeneous network, learning feature embeddings through GNN models. MUSCLE [17] establishes three heterogeneous networks using drugs, mRNAs, and lncRNAs as mediators between miRNAs and diseases, with feature embeddings learned via graph attention neural networks. MHGTMDA [16] integrates eight biological entities (including lncRNAs, circRNAs, and proteins) to build a heterogeneous biological graph containing 16 association types, employing heterogeneous graph transformers for feature encoding. Although these methods construct relatively comprehensive graph data structures, they exhibit three critical limitations: (1) The models learn node representations by aggregating multi-hop neighborhood features based solely on edge-type relationships, failing to capture structural information between links through features like common neighbors; (2) The absence of metapath-guided neighbor aggregation prevents intuitive interpretation of captured semantic relationships; (3) Over-simplified decoders relying on basic Hadamard products inadequately process neighbor embeddings. This architectural design results in significant information loss during both encoding and decoding processes.
The prediction of MDAs inherently constitutes a link prediction problem in heterogeneous information networks, where model architecture design must synergistically integrate network heterogeneity and topological characteristics of link prediction, a critical factor enabling high-performance prediction in this study. Metapath-based heterogeneous GNN methods overcome the limitation of flattening heterogeneous relationships in conventional approaches through self-consistent semantic aggregation strategies that capture structural information within metapath subgraphs and fuse multi-semantic features [25,26]. A metapath, defined as an ordered sequence of node and edge types, effectively characterizes multi-level semantic relationships in heterogeneous graphs [27]. In biological entity heterogeneous graphs, numerous biologically meaningful metapaths exist. For instance, the “miRNA → lncRNA → mRNA → disease” metapath reflects the biological semantics where miRNAs regulate disease pathways through ceRNA mechanisms: miRNAs competitively bind with long non-coding RNAs (lncRNAs), thereby releasing their target mRNAs from suppression [28]. A concrete example involves MALAT1 sequestering the miR-200 family in lung cancer, which elevates ZEB1 mRNA expression and promotes epithelial–mesenchymal transition (EMT) [29,30]. Other representative metapaths include the following:
  • miRNA → mRNA → disease: direct miRNA regulation of aberrant gene expression induces pathogenesis [31].
  • miRNA → drug → disease: therapeutic drugs exert effects by modulating miRNA expression [32].
  • miRNA → circRNA → mRNA → disease: the circRNA-miRNA-mRNA axis plays crucial regulatory roles in diseases [33].
These heterogeneous metapaths connecting node pairs of different types establish informational bridges between biological entities, enabling models to capture rich biological semantics while enhancing logical consistency and interpretability.
Recent advances in link prediction research have enhanced the representation of heterogeneous relationships by integrating structural features (SF) into GNNs. These SFs include common neighbors [34] and resource allocation [35], which enable joint semantic-topological modeling and improve heterogeneous relationship characterization [36]. However, current methods predominantly employ high-order encoders to process high-dimensional semantic and topological node features while relying on simplistic Hadamard products for node representation reconstruction in decoders. This imbalanced design creates an information bottleneck during encoding compression, ultimately leading to unavoidable information dissipation in the decoding phase [37,38].
To address these limitations, we propose CoupleMDA—a robust heterogeneous graph link prediction framework for biological entities that simultaneously captures structure-semantics coupled features of heterogeneous connections between miRNA and disease nodes. Specifically, we leverage the large-scale, heterogeneous graph of eight biological entities constructed by Zou [16] to mine metapaths and develop our model. Our framework autonomously learns the topological structure of heterogeneous graphs and infers target edge connections without over-reliance on heuristic algorithms with human-defined rules or biased prior knowledge. During data preprocessing, we strictly implement dataset partitioning and metapath generation to eliminate data leakage and self-validation loops. The proposed methodology operates through three phases, as shown in Figure 1:
(1)
Primary encoding: employ RGCN to pre-encode all nodes in the original heterogeneous graph, establishing comprehensive preliminary topological relationships.
(2)
Semantic augmentation: perform secondary encoding by fusing metapath semantic features on target edge nodes, enriching node embeddings with homogeneous neighborhood semantics.
(3)
Structural decoding: guide structural decoding using common metapaths (CM) between target node pairs.
The intrinsic semantic information of metapaths combined with structural features from common metapaths enhances the framework’s coupling of graph structure and biological semantics. Guided by this design, we conduct extensive experiments to evaluate CoupleMDA’s superior performance against state-of-the-art baselines.
The contributions of our work are fourfold:
(1)
We propose a novel framework for MDA prediction that automatically identifies diverse metapath weights (e.g., miRNA → lncRNA → mRNA → disease) from biological heterogeneous graphs, capturing cross-entity regulatory semantics to enhance biological interpretability.
(2)
A new decoding mechanism incorporating common metapath structural features through attention-based dynamic weighting of multi-semantic pathways, enabling joint semantic-topological decoding to mitigate information bottlenecks.
(3)
We abandon conventional similarity matrix construction methods (e.g., MISIM) that rely on known MDA data, instead utilizing non-associated data sources to eliminate feature-target space coupling and resolve self-validation loops.
(4)
Extensive experiments demonstrate CoupleMDA’s significant outperformance over non-leakage state-of-the-art baselines across evaluation metrics, confirming its effectiveness and generalization capability in diverse scenarios.

2. Related Work

This section reviews two research directions: deep learning-based methods for MDA prediction and heterogeneous graph link prediction.

2.1. Deep Learning-Based MDA Prediction Methods

GNNs have emerged as the dominant framework for MDA prediction due to their capability to model relationships in biological networks. MHGTMDA [16] introduces a heterogeneous biological entity graph encompassing eight biomolecules, which comprehensively models indirect associations between miRNAs and diseases. HGCNMDA [24] constructs a miRNA-gene-disease heterogeneous network that subdivides node features into initial and inductive components to capture both direct and indirect associations. ReHoGCNES-MDA [11] adopts a homogeneous graph convolutional network with regular graph structures, where its random edge sampling strategy reduces training complexity while maintaining accuracy. ADPMDA [15] dynamically balances local and global node information by integrating an adaptive deep propagation GNN.
Multiple models focus on enhancing feature representation through multi-source data integration. DAEMKL [18] integrates miRNA and disease similarity networks via kernel learning, utilizing reconstruction errors from deep autoencoders to predict novel associations. MGADAE [19] combines multi-kernel learning with graph attention mechanisms, aggregating representations from multiple GCN layers to improve feature discriminability. NSAMDA [12] fuses miRNA sequence similarity with comprehensive similarity metrics, employing neighbor-selective graph attention networks to prioritize influential nodes. RFECV [14] implements a two-phase approach that uses deep attentive autoencoders with recursive feature elimination to selectively retain high-impact features in fused miRNA-disease similarity matrices. GCNPCA [20] combines GCN-derived topological features with PCA-based node attributes, achieving classification via random forests.
MSGCL [23] adopts multi-view self-supervised contrastive learning for MDA prediction, refining graph topologies by optimizing contrastive losses between anchor and learner views. MTLMDA [39] employs multi-task learning to jointly train miRNA-disease and gene-disease networks. CFNCM [13] generates association scores through collaborative filtering and classifies pairs using SVMs.
Notably, some models inadvertently introduce data leakage by mishandling GIP-generated miRNA similarity matrices. Others directly adopt MISIM for MDA prediction, entrenching self-validation loops. Even when avoiding these issues, such features capture only shallow linear relationships.

2.2. Heterogeneous Graph Link Prediction

Heterogeneous graph link prediction aims to infer potential or missing relationships between entities in heterogeneous networks. Metapaths have been widely adopted as essential tools for capturing semantic information in heterogeneous networks. Paths2Pair [40] proposes an entity pair screening strategy based on metapaths, which enhances prediction efficiency through content information aggregation. MAGNN [26] develops an intra- and inter-metapath aggregation framework that combines intermediate semantic nodes with multi-path information on node content transformation, improving link prediction performance. EMAA [41] introduces a bidirectional biased random walk algorithm that integrates RNNs and attention mechanisms to explore explicit and implicit metapath semantics. MHGNN [42] captures high-order dependencies in biological heterogeneous graphs through metapath aggregation and drug–target pairs. MV-HRE [43] jointly utilizes metapath views, community views, and subgraph views, aggregating contextual information via relation-aware attention mechanisms.
MLAN [44] designs a meta-learning-based adaptive network that enhances generalization by transferring shared knowledge through historical link-type community subtasks. MTTM [45] employs an adversarial learning framework where generative predictors and discriminative classifiers compete to learn transferable cross-link feature representations. NOH [38] proposes a heterogeneous hypergraph neural network that integrates low-order neighborhood overlaps with high-order group interactions. HeteHG-VAE [46] maps heterogeneous information networks into hypergraphs, learning latent node and hyperedge representations through Bayesian deep generative frameworks, while modeling multi-level relationships via hyperedge attention modules. THGNN [47] develops a topic-aware heterogeneous graph neural network that extracts multi-topic semantics from textual content through alternating aggregation mechanisms.
However, these models generally suffer from imbalanced encoder–decoder designs that inevitably cause information dissipation during decoding. NCNC [48] addresses this limitation by proposing a novel MPNN(Message Passing Neural Network)-then-SF architecture with neural common neighbors, which guides graph representation learning through structural features to achieve high performance. This approach effectively enhances model generalization by mitigating common neighbor decay and distribution shifts caused by graph incompleteness.

3. Preliminary

In this section, we present key concepts and formal definitions related to heterogeneous graph link prediction.
Definition 1 (Heterogeneous Graph).
A heterogeneous graph is formally defined as G = ( V , E , T , R , τ : V T , ϕ : E R ) , where V = { v 1 , v 2 , , v n } is the set of nodes, each belonging to a predefined node type set T . E = { e 1 , e 2 , , e n } V × V is the set of edges, each belonging to a predefined edge type set R . τ ( v ) T is the node type mapping function, and ϕ ( e ) R is the edge type mapping function. This definition characterizes the diversity of nodes and edges in heterogeneous graphs.
Definition 2 (Metapath).
A metapath P is formally defined as a relation pattern on a heterogeneous graph: P = T 1 R 1 T 2 R 2 . . . R k T k + 1 , where T 1 , T 2 , , T k + 1 T is a sequence of node types in the heterogeneous graph, and R 1 , R 2 , , R k + 1 R is a sequence of edge types connecting adjacent node types. Metapaths reflect complex semantic associations between different types of nodes in heterogeneous graphs.
Definition 3 (Link Prediction).
The goal of link prediction is to predict edges that do not yet exist in the graph. Given a pair of nodes ( v i , v j ) , we aim to predict whether an edge will form between them. We formalize this task as y ^ i j = f ( v i , v j , T , R ) , where y ^ i j is the predicted value of whether an edge exists between nodes v i and v j . This definition transforms the link prediction problem into a binary classification task based on node and relation types.

4. Results

This section evaluates the effectiveness of CoupleMDA through extensive experiments. Specifically, we first introduce the experimental setup, including datasets, evaluation metrics, and baseline methods. Subsequently, we provide a detailed analysis of the experimental results and validate the contributions of individual model components through ablation studies.

4.1. Datasets

To comprehensively evaluate the model’s performance on heterogeneous graph link prediction tasks, we first tested it on three representative public heterogeneous graph datasets. Then, we applied CoupleMDA to a previously constructed biological heterogeneous information network for MDA prediction testing.
The three heterogeneous graph datasets mentioned above are DBLP, LastFM, and Amazon. These datasets all use versions organized by HGB [31]. The specific details of each dataset are as follows:
  • DBLP: This dataset is a subset of an academic network, which, after cleaning, contains rich academic elements: it covers 4057 researchers and their 14,328 published academic papers, along with 7723 professional terms and 20 types of academic publishing institutions.
  • LastFM: This is a music social network dataset used to track and analyze users’ music listening behaviors. It continuously collects cross-platform music playback behavior feature data from users. After information cleaning and standardization, the constructed graph dataset covers multiple dimensions: 1892 listeners, 17,632 music creator accounts, and 1088 associated genre tags.
  • Amazon: This dataset comes from user behavior data on an e-commerce platform. It includes product attributes and co-browsing and co-purchasing links between products. Product attributes include price, sales ranking, brand, and category.
We constructed our biological heterogeneous information network based on the dataset from Zou [16]. The training data were derived from the MDA database in HMDDv3.2, from which we selected 901 miRNAs and 877 diseases to build an adjacency matrix. To enhance the dataset, we integrated updated MDA data from HMDDv4.0 [49] and merged diseases with semantic similarities from the HMDD database. We added 75% of the MDAs as the training set to the graph. Other biological entities incorporated include 3348 protein nodes from the STRING database [50], 2633 lncRNA nodes from NONCODEV5 [51], 421 circRNA nodes from CircBase [52], 1319 drug nodes from DrugBank [53], 3024 mRNA nodes from the NCBI database [54], and 100 microbial nodes from the NIH Medical Subject Headings (MeSH) database [55]. Finally, an equivalent number of non-MDAs were randomly selected as negative controls. The data on associations between biological entities were collected from existing public databases and compiled by Zou et al. [16]. The relevant database sources are listed in Table 1.

4.2. Comparison Methods

To comprehensively evaluate the performance of CoupleMDA, we selected representative heterogeneous graph neural network models and MDA prediction models as baselines. These baseline models can be categorized into three classes:

4.2.1. General Heterogeneous Graph Neural Networks

  • RGCN [69]: This model extends traditional graph convolutional network (GCN) to heterogeneous graph scenarios by performing convolutional operations separately on different types of edges, followed by weighted aggregation. Its core innovation lies in introducing specific weight matrices for each relation type, effectively handling multi-type edge relationships in heterogeneous graphs.
  • HGT [70]: This model introduces a heterogeneous graph transformer architecture, processing different types of nodes and edges through a heterogeneous attention mechanism. Its key feature is leveraging multi-type edge information as weights for message passing and capturing long-range dependencies via a metapath-guided transformer structure.
  • GATNE [71]: This method learns three distinct embeddings for each node: general embeddings, edge-type-specific embeddings, and attribute-enhanced embeddings. By fusing these embeddings, the model captures both global node characteristics and relation-specific features.
  • HetGNN [72]: This model employs type-aware random walk strategies to sample heterogeneous neighbors and processes heterogeneous information through a two-layer aggregation mechanism. The first layer aggregates nodes of the same type, while the second layer fuses features across different types, followed by end-to-end optimization of node representations.
  • Simple-HGN [73]: This model enhances graph attention networks (GAT) performance by introducing three key components: learnable type embeddings, residual connections, and L2 normalization of output embeddings. These simple yet effective modifications significantly improve performance on heterogeneous graphs.
  • HAN [74]: This model designs hierarchical attention mechanisms, including node-level attention and semantic-level attention. Node-level attention aggregates information within individual metapaths, whereas semantic-level attention integrates semantic information from multiple metapaths.
  • MAGNN [26]: The primary innovation of this model lies in preserving and leveraging intermediate node information in metapaths. By designing a metapath instance encoder, the model captures complete path semantics rather than relying solely on path endpoints.

4.2.2. MDA Prediction Models

  • HGCNMDA [24]: This method introduces a gene layer to construct a miRNA-gene-disease heterogeneous network. It employs a multi-relational GCN to encode representations for given miRNAs, diseases, or genes, integrating neighbor representations according to distinct relation types.
  • MHGTMDA [16]: This approach constructs a heterogeneous biological entity graph containing eight biomolecule types to comprehensively model indirect associations between miRNAs and diseases. The model processes diverse node and edge types through a heterogeneous attention mechanism, utilizing multi-type edge information as message-passing weights and capturing long-range dependencies via a metapath-guided transformer architecture.

4.2.3. Link Prediction Models

  • NCNC [48]: This model proposes an MPNN-then-SF framework and introduces the concept of neural common neighbors. By combining message passing with structural features, it significantly improves link prediction performance. Additionally, the model incorporates mechanisms to address graph incompleteness, thereby enhancing prediction generalizability.

4.3. Experimental Setup

For data partitioning, we randomly divided all existing edges (positive samples) in the dataset into training, validation, and test sets with a ratio of 75:10:15. Negative samples were sampled from node pairs that did not have any existing edges in the graph at a 1:1 ratio. To prevent information leakage, the training graph only contained positive edges from the training set, while positive edges in the validation and test sets were removed from the graph during training. We use the results on the test set as the final criterion for evaluating the performance of the model.
Model performance was evaluated using the receiver operating characteristic area under curve (ROC-AUC) and average precision (AP) scores. Additional core classification metrics included precision, recall, and F1 score:
Precision ( P ) = TP TP + FP ,
Recall ( R ) = TP TP + FN ,
F 1 score = 2 · P · R P + R ,
where TP denotes true positives, FP represents false positives, and FN indicates false negatives.
The AUC metric measures the probability that a classifier ranks random positive samples higher than random negative samples:
AUC = i = 1 M j = 1 N [ I ( s i > t j ) + 0.5 · I ( s i = t j ) ] M · N ,
where M and N denote the numbers of positive and negative samples, s i represents the prediction score of the i-th positive sample, t j indicates the prediction score of the j-th negative sample, and I ( · ) is an indicator function that returns 1 when the condition is satisfied. The AP metric approximates the area under the precision–recall curve using the trapezoidal rule:
AP = k = 1 n 1 ( R k + 1 R k ) · P k + P k + 1 2 ,
where R k and P k represent the recall and precision at the k-th threshold.
For experimental configuration, all models used the same five random seeds for dataset partitioning and training, with five independent experimental repetitions. Each different random seed will divide different training and validation sets, but the fixed test set remains invisible to the model before final evaluation. We employed the Adam optimizer with weight decay for parameter optimization and tuned hyperparameters based on validation set performance. All baseline implementations were adapted from their official codebases, with modifications made to their data loading interfaces and downstream decoders. Experiments were conducted on an NVIDIA RTX 4090D GPU with 24 GB of memory.

4.4. Performance Analysis

This section first presents the experimental results of different models on heterogeneous graph link prediction tasks using the DBLP, LastFM, and Amazon datasets to demonstrate the superiority of the CoupleMDA model on general datasets. It then shows the experimental results of different models on MDA prediction tasks using the previously constructed biological heterogeneous information network to verify the robustness of CoupleMDA on specialized datasets.
The experimental results for link prediction on general heterogeneous graph datasets are shown in Table 2. Experiments demonstrate that CoupleMDA consistently outperforms other baselines across different datasets. Compared with the best baseline (NCNC), CoupleMDA achieves a performance improvement of approximately 2–5%, indicating that co-metapaths, as structural information guiding GNN pooling, can maximize information capture. From the performance of different types of methods, metapath-based approaches generally outperform relation-based methods, suggesting that metapaths better capture semantic information in heterogeneous graphs. Further analysis reveals that CoupleMDA performs exceptionally well on real-world datasets with well-structured graph data. This is because, for graphs with fewer isolated edges, the more complete the neighborhood connections of the target edge, the better the model can leverage the coupled advantages of structure and semantics to deliver more accurate predictions.
The experimental results for MDA link prediction on the test set are shown in Table 3 and Figure 2. The performance of different models on the training set and validation set, as well as detailed results on the test set, are provided in Appendix A.1. Specifically, Table A1 records the average metrics of five experiments for different baseline models on the training, validation, and test sets of the Zou dataset. Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10 each record the TP, TN, FP, and FN values for each baseline model on the test set in each experiment. Notably, HGCNMDA and MHGTMDA were evaluated using their models on our reconstructed dataset after removing data that could cause information leakage.
The results demonstrate that CoupleMDA consistently outperforms other baselines across different datasets, achieving approximately 0.5% improvement over the strongest baseline (NCNC). CoupleMDA exhibits superior performance on real-world datasets with well-structured graphs because it can better leverage the coupling advantages of structural and semantic features when target edges have comprehensive neighborhood connections. However, Zou’s dataset contains numerous isolated nodes where certain nodes have significantly more connections in specific categories than others, which obscures structural features and limits CoupleMDA’s potential. Despite this limitation, CoupleMDA still effectively captures semantic information and basic connectivity patterns, outperforming other state-of-the-art models. Crucially, methods that incorporate structural information during decoding (e.g., NCNC and CoupleMDA) surpass approaches that focus solely on encoding node embeddings. This observation indicates that considering structural features during decoding is essential, as processing node features in isolation may neglect edge-specific structural correlations.

4.5. Ablation Experiment

To investigate the contributions of different components, we conducted ablation experiments on various CoupleMDA variants. Table 4 and Figure 3 present the ablation results of different variants on the DBLP, LastFM, and Amazon datasets. Table 5 shows the ablation results of different variants on Zou’s dataset. Model variants’ performance across training/validation/test sets on the Zou dataset (averaged over five trials) is detailed in Appendix A.2 (Table A11), with per-trial test-set TP/TN/FP/FN metrics in Table A12, Table A13, Table A14, Table A15, Table A16. Specifically, we designed the following variants:
  • CoupleMDA-link: removes the co-metapath guided decoding mechanism that incorporates structural features.
  • CoupleMDA-attn: replaces the inter-metapath self-attention with batch-based additive attention.
  • CoupleMDA+iattn: substitutes the simple mean aggregation with intra-metapath attention.
  • CoupleMDA+self_super: incorporates self-supervised learning for node-type and metapath-type classification.
  • CoupleMDA-HMPNN: removes the HMPNN pre-encoding mechanism.
The results reveal performance degradation across all variants when specific components are removed or self-supervised learning is added. CoupleMDA’s superiority primarily stems from its co-metapath mechanism that couples semantic and structural information. The inter-metapath self-attention aggregation demonstrates more robust performance than batch-based additive attention, as it remains unaffected by batch configurations or sample ordering. Other components contribute marginally to the performance, while the addition of self-supervised learning fails to improve results. Although HMPNN provides limited enhancement to final prediction accuracy, it significantly accelerates model convergence during training.

4.6. Visualization Analysis

To investigate the internal mechanisms of our model, we visualize its intermediate results.

4.6.1. Inter-Class Common Metapath Aggregation Weights

We extract the self-attention weights for inter-class common metapath aggregation, as shown in the heatmap of Figure 4. Each sample receives independent attention weights, where higher values indicate greater contribution of specific metapaths to model decisions. We group the weights for positive and negative samples separately.
For positive samples, the weights exhibit a skewed distribution, suggesting that model decisions primarily rely on specific common metapaths. Notably, the miRNA-mRNA-disease metapath provides the highest contribution, followed by miRNA-drug-disease and miRNA-circRNA-disease metapaths, while the miRNA-disease-miRNA-disease metapath shows minimal contribution. This pattern aligns with biological reality since the miRNA-mRNA-disease metapath contains richer connectivity in heterogeneous biological graphs. Our analysis indicates that effective common metapaths expand the receptive field of common neighbors, enabling the model to perceive broader topological structures without losing critical connection details.
In contrast, negative samples in Figure 4 display uniformly distributed weights, suggesting either insufficient structural information capture by the model or the absence of common metapaths for these unconnected node pairs, which consequently receive lower link prediction probabilities.

4.6.2. Encoded Vector Visualization

We qualitatively evaluate the node representations learned by CoupleMDA against baseline models GAT and HGCNMDA through t-SNE visualization of their final-layer encoded vectors on Zou’s dataset. As shown in Figure 5, CoupleMDA-generated embeddings demonstrate significantly clearer cluster structures, exhibiting compact intra-class clustering and distinct inter-class boundaries in the 2D projection space. Different categories are separated by substantial margins, indicating superior discriminative power compared with the more overlapping distributions produced by baseline models.

4.7. Biomedical Case Study

To systematically evaluate the biomedical utility of CoupleMDA, we first performed differential expression analysis of miRNAs between prostate cancer (PCa) tissues and benign prostate tissues and then investigated the correlation between model predictions and differentially expressed miRNAs through statistical methods. Prostate cancer remains one of the most frequently diagnosed malignancies in males, where miRNAs serve as crucial biomarkers with specific expression profiles and therapeutic potential [75].
The differential expression data were obtained from the GSE112264 dataset [76], containing 809 PCa tissue samples and 241 benign prostate tissue samples. Through volcano plot visualization, we identified 1755 differentially expressed miRNAs (Figure 6a), with the top 50 most significant expression shifts displayed in a heatmap (37 upregulated and 13 downregulated, Figure 6c). We selected PCa-associated miRNAs using established thresholds: |log2 FC| > 1 with adjusted p-value < 0.05. miRNAs meeting these criteria were classified as significantly differentially expressed, while others were labeled non-differentially expressed. This selection protocol integrated twofold expression changes and rigorous statistical validation (FDR < 5%) to ensure reliability.
CoupleMDA achieved prediction accuracies of AUC = 91% and AP = 85% when evaluating associations between all 1755 miRNAs and PCa. We validated the top 42 model-predicted PCa-related miRNAs against the dbDEMC database, as illustrated in Figure 6b. Among these, twenty-seven miRNAs received experimental support from existing literature (dbDEMC-EXP00020, EXP00032, EXP00042, EXP00044, EXP00045, EXP00046, EXP00403, EXP00469, EXP00639, EXP00640), while nine miRNAs overlapped with significantly differentially expressed RNAs identified in the GSE112264 analysis. This case study demonstrates the practical value and effectiveness of the proposed CoupleMDA method.

5. Methodology

This section provides a detailed description of the proposed CoupleMDA framework. The model represents a novel structure-semantics-coupled metapath aggregation approach for MDA link prediction in biological heterogeneous graphs. CoupleMDA consists of three core components: (1) metapath discovery, (2) an RGCN-based node pre-encoding module [69], and (3) a secondary encoding module based on metapath aggregation and a common metapath decoding module. Figure 1 illustrates these primary components and their workflow.

5.1. Overview of Model Framework

In link prediction, SF represented by common neighbors and path information are the core reference basis. The MPNN-then-SF architecture proposed by Wang et al. [48] adopts a two-stage design: first executing an MPNN on the original graph, then leveraging structural features to guide the pooling of MPNN-derived features. Inspired by this framework, our model generalizes the MPNN-then-SF approach from homogeneous to heterogeneous graphs, as illustrated in Figure 1.
Figure 1a displays the connection types in the constructed biological heterogeneous information network. Each category of biological entity contains multiple specific instance nodes, where each node connects to others according to the relationship types shown in Figure 1a. For example, the miRNA-type node “hsa-mir-122-5p” is linked to the mRNA-type node “nm_003045.5” in the database.
The workflow of CoupleMDA consists of four stages:
  • Stage 1 (Figure 1b): Discover all metapath types of lengths 2 and 3 from the biological heterogeneous information network. For each metapath type, recursively traverse all nodes to identify all node-level instances and store them.
  • Stage 2 (Figure 1c): learn node representations by aggregating multi-hop neighborhood features based on heterogeneous edge types in the original graph, enabling node embeddings to capture preliminary structural relationships.
  • Stage 3 (Figure 1d): for each target node pair, generate two node-specific embeddings by aggregating their individual metapaths and one common metapath embedding by aggregating common metapaths between the pair.
  • Stage 4 (Figure 1e): Input the Hadamard product of the two node embeddings and the weighted common metapath embedding into a decoder to produce the link prediction. Subsequent sections will elaborate on the core components of the model.
This framework offers two key advantages: First, the common metapath serves as a dominant structural feature to guide link prediction. Second, the pre-encoded node embeddings from the RGCN encapsulate preliminary connectivity and interaction patterns in the graph, thereby enhancing downstream structural feature embeddings. The strong coupling between RGCN and SF ensures high expressiveness for link prediction tasks. Furthermore, the RGCN-based pre-encoding addresses incomplete metapath issues caused by missing links [41].

5.2. MetaPath Discovery

The constructed biological heterogeneous graph comprises eight node types and sixteen edge types, as visually detailed in Figure 1a and Table 6. During data preprocessing, we identify all applicable metapaths for model learning. Prior to metapath discovery, the dataset is strictly partitioned: only miRNA-disease edges from the training set are retained in the original graph, while validation and test set edges are removed.
We define two metapath categories:
  • Node-specific metapaths (P(v)): paths starting from target node v and ending with nodes of the same type (example: miRNA-circRNA-miRNA (MCM)).
  • Common metapaths (P(v, u)): paths connecting heterogeneous node pairs (v and u), serving as bridges between different entity types (example: miRNA-lncRNA-mRNA-disease (MLRD)).
Node-specific metapaths (e.g., MCM) capture semantic interactions between homogeneous nodes, whereas common metapaths (e.g., MLRD) establish interpretable bridges between heterogeneous nodes and guide structural pooling.
We discovered 12 node-specific and 13 common metapaths (Figure 1b, Table 6). To enable self-attention mechanisms to autonomously learn metapath contributions, we exhaustively generate all permuted metapath combinations within specified lengths. According to γ -decay theory [26,77], higher-order structural features can be effectively approximated through low-order neighbors (small h-hops), as approximation errors decrease exponentially with h. Thus, metapaths with lengths 3 suffice to capture sufficient high-order structural information.

5.3. RGCN Node Pre-Encoding Module

Prior to feeding node vectors into RGCN, we apply type-specific linear transformations to project feature vectors of different node types into a common latent space. For each node v T of type T, the transformation is defined as
h v ( 0 ) = h v T = W T · x v T ,
where h v ( k ) denotes the feature representation of node v at layer k, x v T R d T represents the original feature vector, and W T R d × d T is the learnable weight matrix for type T nodes.
After a comprehensive evaluation, we select a single-layer RGCN as the pre-encoding module. This shallow architecture focuses on capturing direct first-order neighborhood features while preserving primitive topological information fidelity. As shown in Figure 1c, RGCN handles heterogeneous edges through relational-aware processing. The update rule for each node v in RGCN is formulated as
h v ( k + 1 ) = σ W 0 h v ( k ) + r R u N r ( v ) 1 c r W r h u ( k ) ,
where R denotes the set of all relation types, N r ( v ) represents the neighborhood of node v under relation r, and σ is the activation function.

5.4. MetaPath Full-Node Encoding

To comprehensively exploit semantic information from metapaths, we design a three-layer nested full-node encoder architecture that progressively processes information through the following stages: individual metapath encoding, intra-class metapath aggregation, and inter-class metapath aggregation.
For encoding individual metapaths, we employ linear transformation layers. The encoded representations of node-specific metapath P ( v ) and common metapath P ( v , u ) are formulated as
h P ( v ) = W P · MEAN ( { h t | t P ( v ) } ) h P ( v , u ) = W P · MEAN ( { h t | t P ( v , u ) } ) ,
where metapaths of the same category share the identical weight matrix W P R d × d T .
After completing individual metapath encoding, we obtain encoded metapaths M P = { h P 1 , h P 2 , , h P m } for category P. We then apply graph attention network [78] to compute weighted sums of all metapath encodings related to target nodes. The aggregated representation for category P is formulated as
h P = σ h P i M P α i P h P i α i P = exp LeakyReLU ( Θ h P i ) s M P exp LeakyReLU ( Θ s ) ,
where Θ R d denotes the learnable parameter.
As demonstrated by Yang et al. [25], neighbor attention mechanisms within homogeneous relations are non-essential, where simple mean aggregation achieves comparable effectiveness to attention-based approaches. We therefore formulate an alternative intra-class metapath aggregation as
h P = σ 1 M P h P i M P h P i .
Finally, we employ an inter-class metapath aggregation layer to integrate semantic information revealed by all metapaths. Unlike MAGNN’s batch-wise additive attention mechanism, we implement sample-wise multiplicative self-attention to aggregate metapath vectors. This design ensures batch independence while enabling adaptive attention weights per sample, thereby enhancing model robustness. Given encoded metapaths H P = { h ( P 1 ) , h ( P 2 ) , , h ( P M ) } , the inter-class aggregation is computed through
X = tanh ( W 1 h P + b 1 ) Q = X W Q , K = X W K , V = X W V β = W 2 softmax Q K T d k V h = i = 1 M β i · h v ( P i )
where X R d denotes the activated output from the linear transformation of h v P , Q , K , V R d k are linear projections of X, and W 2 R d k × 1 transforms attention weights β R M for all metapaths.

5.5. Link Prediction Decoder

For a specific link ( u , v ) , we generate node-related metapaths P ( u ) and P ( v ) , along with their common metapath P ( u , v ) . Through HMPNN’s pre-encoding and full-node metapath encoding, we obtain encoded representations h u , h v , and h u v for both nodes and their link. The node encodings h u and h v are secondary encodings derived from pre-encoded features, which capture semantic interaction information between homogeneous nodes since the metapaths encoding these nodes connect homogeneous nodes at both ends. The link encoding h u v originates from common metapaths between nodes u and v, meaning these nodes are interconnected through intermediate nodes that form multiple communication pathways, constituting the most critical structural features of the connection.
We present the simplified topological structure schematic diagram of the previously constructed heterogeneous biological entity graph in Figure 7. This figure is drawn by selecting a specific disease as the source node and a specific miRNA as the target node and trimming most connections in the real dataset to retain only a few. Two potentially related nodes are connected through a pathway composed of multiple intermediate nodes. The source node connects to several hub nodes (e.g., mRNA-type nodes), while most satellite nodes linked to the target node associate with these hubs. All intermediate nodes provide structural insights for the source-target connection. Therefore, intermediate nodes along common metapaths serve as crucial structural features whose embeddings couple the link’s semantic features. This integration of structural and semantic features provides direct evidence for determining node connectivity. For node pairs with missing connections, we complete their metapaths using the nodes themselves, which aligns with biological intuition.
Based on the theoretical framework, we design a specialized decoder for link prediction:
y = f f 1 ( h u h v ) + β · f 2 ( h u v )
where f, f 1 , and f 2 denote multilayer perceptrons (MLPs) containing linear layers, layer normalization, and activation functions. The operator ⊙ represents the Hadamard product, and β is a learnable parameter. This decoder enhances link prediction performance by combining primary insights from node-level embeddings ( h u h v ) with higher-order structural and semantic insights ( h u v ).

6. Conclusions

This study addresses key limitations in current MDA prediction methodologies. First, by abandoning GIP similarity metrics and establishing strict data isolation, we resolve the persistent self-validation loop issue in prior research that artificially inflated performance metrics. Second, the proposed metapath-guided structural-semantic coupling mechanism simultaneously models multi-entity regulatory semantics and topological features, overcoming the information bottleneck in conventional GNN architectures. Third, our structure-aware decoding strategy enhances prediction accuracy and interpretability by dynamically weighting heterogeneous biological evidence through learnable metapath attention coefficients. Experimental validations across multiple scenarios demonstrate CoupleMDA’s robustness. The framework’s biological plausibility is confirmed through its ability to identify prostate cancer-associated miRNAs via drug-mediated pathways. These advancements not only establish new state-of-the-art performance in MDA prediction but also provide a blueprint for addressing feature-target coupling challenges in broader biomedical relation prediction tasks. Future work will extend this paradigm to multi-omics integration and temporal association modeling.

Author Contributions

Z.L.: Methodology, software and data curation, writing—original draft preparation. G.C.: methodology, software, writing—original draft preparation. G.T.: methodology, software, writing—original draft preparation. C.Y.-C.C.: conceptualization of this study, methodology, software. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62176272), the Research and Development Program of the Guangzhou Science and Technology Bureau (No. 2023B01J1016), and the Key-Area Research and Development Program of Guangdong Province (No. 2020B1111100001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code and data of this study are available at https://github.com/lizhj39/CoupleMDA (accessed on 17 March 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this report.

Appendix A

Appendix A.1

Table A1. Comparative performance analysis of heterogeneous graph neural networks (mean ± std%). All metrics were averaged over five independent runs. AUC (area under curve), AP (average precision), and Acc (accuracy).
Table A1. Comparative performance analysis of heterogeneous graph neural networks (mean ± std%). All metrics were averaged over five independent runs. AUC (area under curve), AP (average precision), and Acc (accuracy).
ModelSubdatasetPerformance Metrics
AUC AP Precision Recall F1 Score Acc
GATNETrain76.84 ± 0.6278.47 ± 1.2357.53 ± 1.3299.33 ± 0.3173.18 ± 1.3459.47 ± 1.03
Val76.12 ± 0.5777.82 ± 1.1756.79 ± 1.2699.78 ± 0.2972.51 ± 1.2358.83 ± 0.92
Test75.70 ± 0.6477.32 ± 1.2556.31 ± 1.3799.66 ± 0.3371.96 ± 1.3661.35 ± 0.51
HetGNNTrain91.23 ± 0.3290.78 ± 0.5185.52 ± 0.5779.82 ± 1.0282.53 ± 0.1683.03 ± 0.82
Val90.53 ± 0.3490.14 ± 0.5684.63 ± 0.5979.23 ± 1.0981.79 ± 0.1782.32 ± 0.73
Test90.06 ± 0.3589.60 ± 0.5884.21 ± 0.5978.46 ± 1.0881.24 ± 0.1781.88 ± 0.60
Simple-
HGN
Train95.38 ± 0.8194.17 ± 1.0383.48 ± 2.5295.47 ± 1.1288.97 ± 1.9289.82 ± 0.91
Val94.62 ± 0.8393.53 ± 1.0682.57 ± 2.4695.83 ± 1.2188.31 ± 1.8689.13 ± 0.81
Test94.16 ± 0.8693.02 ± 1.0782.10 ± 2.5794.32 ± 1.1987.79 ± 1.9987.48 ± 0.77
HANTrain95.21 ± 0.9193.48 ± 0.3688.03 ± 1.3689.53 ± 0.4688.82 ± 0.3688.63 ± 0.61
Val94.43 ± 0.8792.81 ± 0.4187.24 ± 1.3189.81 ± 0.5287.93 ± 0.4087.82 ± 0.51
Test93.95 ± 0.9792.15 ± 0.3886.71 ± 1.4488.25 ± 0.5187.48 ± 0.3987.02 ± 0.94
MAGNNTrain95.79 ± 0.6395.19 ± 0.3988.12 ± 1.0393.04 ± 0.6690.51 ± 1.2789.84 ± 0.72
Val95.13 ± 0.6194.52 ± 0.4387.31 ± 1.0692.51 ± 0.7189.72 ± 1.3189.15 ± 0.61
Test94.63 ± 0.6793.97 ± 0.4186.78 ± 1.0991.79 ± 0.7289.21 ± 1.3389.24 ± 0.59
NCNCTrain96.27 ± 0.5295.72 ± 0.3986.83 ± 1.0196.03 ± 1.1690.62 ± 1.2190.29 ± 0.62
Val95.53 ± 0.5395.04 ± 0.4186.05 ± 1.0695.84 ± 1.2290.04 ± 1.2689.63 ± 0.52
Test94.96 ± 0.5594.42 ± 0.4285.47 ± 1.0594.62 ± 1.2189.30 ± 1.2789.28 ± 0.23
HGCNMDATrain95.12 ± 0.1694.03 ± 0.5690.82 ± 0.1689.83 ± 0.7189.51 ± 0.2989.53 ± 0.51
Val94.31 ± 0.1993.32 ± 0.5990.13 ± 0.1990.21 ± 0.7689.32 ± 0.3189.22 ± 0.41
Test93.92 ± 0.1892.71 ± 0.5889.64 ± 0.1888.35 ± 0.7688.99 ± 0.3188.99 ± 0.20
MHGTMDATrain95.88 ± 0.2994.81 ± 0.7189.93 ± 0.4191.53 ± 1.3190.63 ± 0.5190.03 ± 0.52
Val95.21 ± 0.3194.13 ± 0.7689.14 ± 0.4391.04 ± 1.3689.81 ± 0.5789.34 ± 0.41
Test94.64 ± 0.3193.51 ± 0.7788.63 ± 0.4389.97 ± 1.3989.29 ± 0.5688.71 ± 0.30
CoupleMDATrain96.13 ± 0.3195.63 ± 0.2991.82 ± 1.6193.52 ± 1.5191.53 ± 0.5691.23 ± 0.51
Val95.72 ± 0.3395.21 ± 0.3191.03 ± 1.6692.83 ± 1.5690.72 ± 0.5990.42 ± 0.41
Test95.36 ± 0.3594.84 ± 0.3290.51 ± 1.6991.92 ± 1.5590.21 ± 0.5890.64 ± 0.69
Table A2. Experimental results of GATNE model on different runs in the test set. (TP: true positives, TN: true negatives, FP: false positives, FN: false negatives).
Table A2. Experimental results of GATNE model on different runs in the test set. (TP: true positives, TN: true negatives, FP: false positives, FN: false negatives).
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
1151335211711056.3799.3471.9361.23
215223391184156.2599.9371.9861.10
315153691154856.7699.4772.2861.85
415183301193555.9999.6771.7160.67
515203651158356.7699.8072.3661.88
Table A3. Experimental results of HetGNN model on different runs in the test set.
Table A3. Experimental results of HetGNN model on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11184131520833985.0677.7481.2382.04
21195129922432884.2178.4681.2481.88
31189128923433483.5678.0780.7281.35
41193130222133084.3778.3381.2481.91
51206128024331783.2379.1981.1681.62
Table A4. Experimental results of Simple-HGN model on different runs in the test set.
Table A4. Experimental results of Simple-HGN model on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11422125726610184.2493.3788.5787.95
2145412252986982.9995.4788.7987.95
31402128423912185.4492.0688.6288.18
4143811973268581.5294.4287.5086.51
5143112133109282.1993.9687.6886.80
Table A5. Experimental results of HAN model on different runs in the test set.
Table A5. Experimental results of HAN model on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11344131720617986.7188.2587.4787.36
21337127824518684.5187.7986.1285.85
31360132419916387.2489.3088.2588.12
41324134118219987.9286.9387.4287.49
51329129722619485.4787.2686.3586.21
Table A6. Experimental results of MAGNN model on different runs in the test set.
Table A6. Experimental results of MAGNN model on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
1142713132109687.1793.7090.3289.95
21398132819512587.7691.7989.7389.49
31392130022313186.1991.4088.7288.38
41423128723610085.7793.4389.4488.97
51383134118214088.3790.8189.5789.43
Table A7. Experimental results of NCNC model on different runs in the test set.
Table A7. Experimental results of NCNC model on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
1144912742497485.3495.1489.9789.40
21423129223110086.0393.4389.5889.13
3146812452785584.0896.3989.8189.07
41419131121210487.0093.1789.9889.63
5143412822418985.6194.1689.6889.17
Table A8. Experimental results of HGCNMDA model on different runs in the test set.
Table A8. Experimental results of HGCNMDA model on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11344136815517989.6688.2588.9589.03
21346136515817789.4988.3888.9389.00
31343136615718089.5388.1888.8588.94
41349136615717489.5888.5889.0789.13
51342136415918189.4188.1288.7688.84
Table A9. Experimental results of MHGTMDA model on different runs in the test set.
Table A9. Experimental results of MHGTMDA model on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11370134917415388.7389.9589.3489.26
21375134917414888.7790.2889.5289.43
31378134417914588.5090.4889.4889.36
41374134717614988.6590.2289.4289.33
51360134917416388.6689.3088.9888.94
Table A10. Experimental results of CoupleMDA model on different runs in the test set.
Table A10. Experimental results of CoupleMDA model on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11398135616712589.3391.7990.5490.41
21415133618710888.3392.9190.5690.32
31413137914411090.7592.7891.7591.66
41383138813514091.1190.8190.9690.97
51369136815515489.8389.8989.8689.86

Appendix A.2

Table A11. Performance comparison of CoupleMDA variants with different architectures (mean ± std%). Performance metrics are reported as mean ± standard deviation over five runs. AUC (area under curve), AP (average precision), Acc (accuracy).
Table A11. Performance comparison of CoupleMDA variants with different architectures (mean ± std%). Performance metrics are reported as mean ± standard deviation over five runs. AUC (area under curve), AP (average precision), Acc (accuracy).
Model
Variant
SubdatasetPerformance Metrics
AUC AP Precision Recall F1 Score Acc
CoupleMDA-
link
Train95.12 ± 0.3894.72 ± 0.2886.34 ± 0.4291.85 ± 0.5589.01 ± 0.4289.12 ± 0.62
Val95.47 ± 0.3595.02 ± 0.3085.83 ± 0.4094.28 ± 0.5089.83 ± 0.3889.55 ± 0.58
Test94.84 ± 0.5994.31 ± 0.3184.70 ± 0.4094.16 ± 0.5189.18 ± 0.9688.15 ± 1.10
CoupleMDA-
attn
Train94.68 ± 0.6594.15 ± 0.7287.12 ± 1.0891.23 ± 0.5889.08 ± 1.2588.92 ± 0.82
Val95.03 ± 0.6294.67 ± 0.6886.95 ± 1.0593.88 ± 0.5589.73 ± 1.2089.35 ± 0.78
Test94.77 ± 0.7294.21 ± 0.7987.03 ± 1.0691.66 ± 0.5689.29 ± 1.3489.03 ± 0.59
CoupleMDA
+iattn
Train95.23 ± 0.4194.82 ± 0.3887.45 ± 1.0592.98 ± 0.6589.82 ± 0.5089.34 ± 0.70
Val95.63 ± 0.3895.24 ± 0.3587.02 ± 1.0294.42 ± 0.6290.32 ± 0.4789.92 ± 0.65
Test95.16 ± 0.4794.64 ± 0.4286.90 ± 1.0793.92 ± 0.6989.34 ± 0.4689.27 ± 0.60
CoupleMDA-
iattn+
self_super
Train94.85 ± 0.6894.63 ± 0.1287.52 ± 1.1592.05 ± 0.8089.58 ± 1.2288.95 ± 0.75
Val95.22 ± 0.6495.01 ± 0.1087.25 ± 1.1293.85 ± 0.7890.12 ± 1.1889.62 ± 0.72
Test94.82 ± 0.7594.57 ± 0.0987.17 ± 1.1592.32 ± 0.8589.67 ± 1.2889.49 ± 0.70
CoupleMDA-
HMPNN
Train94.93 ± 0.5294.42 ± 0.6086.85 ± 1.1891.28 ± 0.3388.92 ± 0.2888.45 ± 0.55
Val95.31 ± 0.4994.88 ± 0.5686.63 ± 1.1592.84 ± 0.3089.45 ± 0.2589.12 ± 0.52
Test94.96 ± 0.5794.20 ± 0.6886.49 ± 1.1891.66 ± 0.3189.00 ± 0.2588.58 ± 0.50
CoupleMDATrain95.47 ± 0.3094.91 ± 0.2587.19 ± 1.6592.11 ± 1.5089.58 ± 0.5589.29 ± 0.55
Val96.05 ± 0.2895.33 ± 0.2786.95 ± 1.6094.22 ± 1.4890.35 ± 0.5389.38 ± 0.52
Test95.35 ± 0.4594.84 ± 0.3290.51 ± 1.6991.92 ± 1.5590.21 ± 0.5889.93 ± 0.30
Table A12. Experimental results of CoupleMDA-link variant on different runs in the test set.
Table A12. Experimental results of CoupleMDA-link variant on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11419125926410484.3193.1788.5287.92
2143412702538985.0094.1689.3588.77
3144512832407885.7694.8890.0989.56
41410122829511382.7092.5887.3686.61
51423125526810084.1593.4388.5587.92
Table A13. Experimental results of CoupleMDA-attn variant on different runs in the test set.
Table A13. Experimental results of CoupleMDA-attn variant on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11387129722613685.9991.0788.4688.12
21393131520813087.0191.4689.1888.90
31397132320012687.4891.7389.5589.30
41401133219112288.0091.9989.9589.72
51401131420912287.0291.9989.4489.13
Table A14. Experimental results of CoupleMDA+iattn variant on different runs in the test set.
Table A14. Experimental results of CoupleMDA+iattn variant on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11403130921412086.7792.1289.3689.03
2143113241999287.7993.9690.7790.45
31388130421913586.3791.1488.6988.38
41417133418910688.2393.0490.5790.32
51391129522813285.9291.3388.5488.18
Table A15. Experimental results of CoupleMDA+self_super variant on different runs in the test set.
Table A15. Experimental results of CoupleMDA+self_super variant on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11411131620711287.2192.6589.8489.53
21409132819511487.8492.5190.1289.86
31410130421911386.5692.5889.4789.10
41405134118211888.5392.2590.3590.15
51415129023310885.8692.9189.2588.80
Table A16. Experimental results of CoupleMDA-HMPNN variant on different runs in the test set.
Table A16. Experimental results of CoupleMDA-HMPNN variant on different runs in the test set.
ExpTPTNFPFNPrecision (%)Recall (%)F1 Score (%)Acc (%)
11396130521812786.4991.6689.0088.67
21395131420912886.9791.6089.2288.94
31391128423913285.3491.3388.2387.82
41400129922412386.2191.9288.9788.61
51396130821512786.6591.6689.0988.77

References

  1. Bartel, D.P. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell 2004, 116, 281–297. [Google Scholar] [CrossRef] [PubMed]
  2. Ambros, V. microRNAs: Tiny Regulators with Great Potential. Cell 2001, 107, 823–826. [Google Scholar] [CrossRef] [PubMed]
  3. Ambros, V. MicroRNA Pathways in Flies and Worms: Growth, Death, Fat, Stress, and Timing. Cell 2003, 113, 673–676. [Google Scholar] [CrossRef] [PubMed]
  4. Jiang, H.; Moro, A.; Wang, J.; Meng, D.; Zhan, X.; Wei, Q. MicroRNA-338-3p as a Novel Therapeutic Target for Intervertebral Disc Degeneration. Exp. Mol. Med. 2021, 53, 1356–1365. [Google Scholar] [CrossRef]
  5. Moradifard, S.; Hoseinbeyki, M.; Ganji, S.M.; Minuchehr, Z. Analysis of microRNA and Gene Expression Profiles in Alzheimer’s Disease: A Meta-Analysis Approach. Sci. Rep. 2018, 8, 4767. [Google Scholar] [CrossRef]
  6. Yan, W.; Wu, X.; Zhou, W.; Fong, M.Y.; Cao, M.; Liu, J.; Liu, X.; Chen, C.H.; Fadare, O.; Pizzo, D.P.; et al. Cancer-Cell-Secreted Exosomal miR-105 Promotes Tumour Growth through the MYC-Dependent Metabolic Reprogramming of Stromal Cells. Nat. Cell Biol. 2018, 20, 597–609. [Google Scholar] [CrossRef]
  7. He, X.Y.; Liao, Y.D.; Guo, X.Q.; Wang, R.; Xiao, Z.Y.; Wang, Y.G. Prognostic Role of microRNA-21 Expression in Brain Tumors: A Meta-Analysis. Mol. Neurobiol. 2016, 53, 1856–1861. [Google Scholar] [CrossRef]
  8. Chen, X.; Ba, Y.; Ma, L.; Cai, X.; Yin, Y.; Wang, K.; Guo, J.; Zhang, Y.; Chen, J.; Guo, X.; et al. Characterization of microRNAs in Serum: A Novel Class of Biomarkers for Diagnosis of Cancer and Other Diseases. Cell Res. 2008, 18, 997–1006. [Google Scholar] [CrossRef]
  9. Chen, X.; Xie, D.; Zhao, Q.; You, Z.H. MicroRNAs and Complex Diseases: From Experimental Results to Computational Models. Briefings Bioinform. 2019, 20, 515–539. [Google Scholar] [CrossRef]
  10. Huang, L.; Zhang, L.; Chen, X. Updated Review of Advances in microRNAs and Complex Diseases: Taxonomy, Trends and Challenges of Computational Models. Briefings Bioinform. 2022, 23, bbac358. [Google Scholar] [CrossRef]
  11. Zhang, Y.; Chu, Y.; Lin, S.; Xiong, Y.; Wei, D.Q. ReHoGCNES-MDA: Prediction of miRNA-Disease Associations Using Homogenous Graph Convolutional Networks Based on Regular Graph with Random Edge Sampler. Briefings Bioinform. 2024, 25, bbae103. [Google Scholar] [CrossRef]
  12. Zhao, H.; Li, Z.; You, Z.H.; Nie, R.; Zhong, T. Predicting Mirna-Disease Associations Based on Neighbor Selection Graph Attention Networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 1298–1307. [Google Scholar] [CrossRef]
  13. Momanyi, B.M.; Zulfiqar, H.; Grace-Mercure, B.K.; Ahmed, Z.; Ding, H.; Gao, H.; Liu, F. CFNCM: Collaborative Filtering Neighborhood-Based Model for Predicting miRNA-Disease Associations. Comput. Biol. Med. 2023, 163, 107165. [Google Scholar] [CrossRef]
  14. Sujamol, S.; Vimina, E.R.; Krishnakumar, U. Improving miRNA Disease Association Prediction Accuracy Using Integrated Similarity Information and Deep Autoencoders. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 1125–1136. [Google Scholar] [CrossRef]
  15. Hu, H.; Zhao, H.; Zhong, T.; Dong, X.; Wang, L.; Han, P.; Li, Z. Adaptive Deep Propagation Graph Neural Network for Predicting miRNA–Disease Associations. Briefings Funct. Genom. 2023, 22, 453–462. [Google Scholar] [CrossRef] [PubMed]
  16. Zou, H.; Ji, B.; Zhang, M.; Liu, F.; Xie, X.; Peng, S. MHGTMDA: Molecular Heterogeneous Graph Transformer Based on Biological Entity Graph for miRNA-Disease Associations Prediction. Mol. Ther.-Nucleic Acids 2024, 35, 102139. [Google Scholar] [CrossRef]
  17. Ji, B.; Zou, H.; Xu, L.; Xie, X.; Peng, S. MUSCLE: Multi-View and Multi-Scale Attentional Feature Fusion for microRNA–Disease Associations Prediction. Briefings Bioinform. 2024, 25, bbae167. [Google Scholar] [CrossRef] [PubMed]
  18. Zhou, F.; Yin, M.M.; Jiao, C.N.; Zhao, J.X.; Zheng, C.H.; Liu, J.X. Predicting miRNA–Disease Associations Through Deep Autoencoder with Multiple Kernel Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5570–5579. [Google Scholar] [CrossRef] [PubMed]
  19. Jiao, C.N.; Zhou, F.; Liu, B.M.; Zheng, C.H.; Liu, J.X.; Gao, Y.L. Multi-Kernel Graph Attention Deep Autoencoder for MiRNA-Disease Association Prediction. IEEE J. Biomed. Health Inform. 2023, 28, 1110–1121. [Google Scholar] [CrossRef]
  20. Liu, J.; Kuang, Z.; Deng, L. GCNPCA: MiRNA-Disease Associations Prediction Algorithm Based on Graph Convolutional Neural Networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 1041–1052. [Google Scholar] [CrossRef]
  21. Wang, D.; Wang, J.; Lu, M.; Song, F.; Cui, Q. Inferring the Human microRNA Functional Similarity and Functional Network Based on microRNA-Associated Diseases. Bioinformatics 2010, 26, 1644–1650. [Google Scholar] [CrossRef] [PubMed]
  22. Xiao, Q.; Luo, J.; Liang, C.; Cai, J.; Ding, P. A Graph Regularized Non-Negative Matrix Factorization Method for Identifying microRNA-Disease Associations. Bioinformatics 2018, 34, 239–248. [Google Scholar] [CrossRef] [PubMed]
  23. Ruan, X.; Jiang, C.; Lin, P.; Lin, Y.; Liu, J.; Huang, S.; Liu, X. MSGCL: Inferring miRNA–Disease Associations Based on Multi-View Self-Supervised Graph Structure Contrastive Learning. Briefings Bioinform. 2023, 24, bbac623. [Google Scholar] [CrossRef] [PubMed]
  24. Peng, W.; Che, Z.; Dai, W.; Wei, S.; Lan, W. Predicting miRNA-Disease Associations from miRNA-Gene-Disease Heterogeneous Network with Multi-Relational Graph Convolutional Network Model. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 3363–3375. [Google Scholar] [CrossRef]
  25. Yang, X.; Yan, M.; Pan, S.; Ye, X.; Fan, D. Simple and Efficient Heterogeneous Graph Neural Network. Proc. AAAI Conf. Artif. Intell. 2023, 37, 10816–10824. [Google Scholar] [CrossRef]
  26. Fu, X.; Zhang, J.; Meng, Z.; King, I. MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2331–2341. [Google Scholar] [CrossRef]
  27. Dong, Y.; Chawla, N.V.; Swami, A. Metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 135–144. [Google Scholar] [CrossRef]
  28. Liu, W.; Higashikuni, Y.; Sata, M. Linking RNA Dynamics to Heart Disease: The lncRNA/miRNA/mRNA Axis in Myocardial Ischemia–Reperfusion Injury. Hypertens. Res. 2022, 45, 1067–1069. [Google Scholar] [CrossRef]
  29. Li, Q.; Zhang, C.; Chen, R.; Xiong, H.; Qiu, F.; Liu, S.; Zhang, M.; Wang, F.; Wang, Y.; Zhou, X.; et al. Disrupting MALAT1/miR-200c Sponge Decreases Invasion and Migration in Endometrioid Endometrial Carcinoma. Cancer Lett. 2016, 383, 28–40. [Google Scholar] [CrossRef]
  30. Ji, P.; Diederichs, S.; Wang, W.; Böing, S.; Metzger, R.; Schneider, P.M.; Tidow, N.; Brandt, B.; Buerger, H.; Bulk, E.; et al. MALAT-1, a Novel Noncoding RNA, and Thymosin β4 Predict Metastasis and Survival in Early-Stage Non-Small Cell Lung Cancer. Oncogene 2003, 22, 8031–8041. [Google Scholar] [CrossRef]
  31. Kariuki, D.; Asam, K.; Aouizerat, B.E.; Lewis, K.A.; Florez, J.C.; Flowers, E. Review of Databases for Experimentally Validated Human microRNA–mRNA Interactions. Database 2023, 2023, baad014. [Google Scholar] [CrossRef]
  32. Hill, M.; Tran, N. miRNA Interplay: Mechanisms and Consequences in Cancer. Dis. Model. Mech. 2021, 14, dmm047662. [Google Scholar] [CrossRef]
  33. He, C.; Duan, L.; Zheng, H.; Li-Ling, J.; Song, L.; Li, L. Graph Convolutional Network Approach to Discovering Disease-Related circRNA-miRNA-mRNA Axes. Methods 2022, 198, 45–55. [Google Scholar] [CrossRef] [PubMed]
  34. Newman, M.E.J. Clustering and Preferential Attachment in Growing Networks. Phys. Rev. E 2001, 64, 025102. [Google Scholar] [CrossRef] [PubMed]
  35. Zhou, T.; Lü, L.; Zhang, Y.C. Predicting Missing Links via Local Information. Eur. Phys. J. B 2009, 71, 623–630. [Google Scholar] [CrossRef]
  36. Li, J.; Shomer, H.; Mao, H.; Zeng, S.; Ma, Y.; Shah, N.; Tang, J.; Yin, D. Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking. Adv. Neural Inf. Process. Syst. 2023, 36, 3853–3866. [Google Scholar]
  37. Tan, Q.; Zhang, X.; Liu, N.; Zha, D.; Li, L.; Chen, R.; Choi, S.H.; Hu, X. Bring Your Own View: Graph Neural Networks for Link Prediction with Personalized Subgraph Selection. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 625–633. [Google Scholar] [CrossRef]
  38. Lu, Y.; Gao, M.; Liu, H.; Liu, Z.; Yu, W.; Li, X.; Jiao, P. Neighborhood Overlap-Aware Heterogeneous Hypergraph Neural Network for Link Prediction. Pattern Recognit. 2023, 144, 109818. [Google Scholar] [CrossRef]
  39. He, Q.; Qiao, W.; Fang, H.; Bao, Y. Improving the Identification of miRNA–Disease Associations with Multi-Task Learning on Gene–Disease Networks. Briefings Bioinform. 2023, 24, bbad203. [Google Scholar] [CrossRef]
  40. Hang, J.; Hong, Z.; Feng, X.; Wang, G.; Yang, G.; Li, F.; Song, X.; Zhang, D. Paths2Pair: Meta-Path Based Link Prediction in Billion-Scale Commercial Heterogeneous Graphs. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 5082–5092. [Google Scholar] [CrossRef]
  41. Shao, H.; Wang, L.; Zhu, R. Link Prediction for Heterogeneous Information Networks Based on Enhanced Meta-Path Aggregation and Attention Mechanism. Int. J. Mach. Learn. Cybern. 2023, 14, 3087–3103. [Google Scholar] [CrossRef]
  42. Li, M.; Cai, X.; Xu, S.; Ji, H. Metapath-Aggregated Heterogeneous Graph Neural Network for Drug–Target Interaction Prediction. Briefings Bioinform. 2023, 24, bbac578. [Google Scholar] [CrossRef] [PubMed]
  43. Mitra, A.; Vijayan, P.; Singh, S.R.; Goswami, D.; Parthasarathy, S.; Ravindran, B. Revisiting Link Prediction on Heterogeneous Graphs with a Multi-View Perspective. In Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM), Orlando, FL, USA, 28 November–1 December 2022; pp. 358–367. [Google Scholar] [CrossRef]
  44. Wang, H.; Mi, J.; Guo, X.; Hu, P. Meta-Learning Adaptation Network for Few-Shot Link Prediction in Heterogeneous Social Networks. Inf. Process. Manag. 2023, 60, 103418. [Google Scholar] [CrossRef]
  45. Wang, H.; Cui, Z.; Liu, R.; Fang, L.; Sha, Y. A Multi-Type Transferable Method for Missing Link Prediction in Heterogeneous Social Networks. IEEE Trans. Knowl. Data Eng. 2023, 35, 10981–10991. [Google Scholar] [CrossRef]
  46. Fan, H.; Zhang, F.; Wei, Y.; Li, Z.; Zou, C.; Gao, Y.; Dai, Q. Heterogeneous Hypergraph Variational Autoencoder for Link Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4125–4138. [Google Scholar] [CrossRef]
  47. Xu, S.; Yang, C.; Shi, C.; Fang, Y.; Guo, Y.; Yang, T.; Zhang, L.; Hu, M. Topic-Aware Heterogeneous Graph Neural Network for Link Prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Queensland, Australia, 1–5 November 2021; pp. 2261–2270. [Google Scholar] [CrossRef]
  48. Wang, X.; Yang, H.; Zhang, M. Neural Common Neighbor with Completion for Link Prediction. arXiv 2024, arXiv:2302.00890. [Google Scholar] [CrossRef]
  49. Cui, C.; Zhong, B.; Fan, R.; Cui, Q. HMDD v4.0: A Database for Experimentally Supported Human microRNA-Disease Associations. Nucleic Acids Res. 2024, 52, D1327–D1332. [Google Scholar] [CrossRef] [PubMed]
  50. Szklarczyk, D.; Gable, A.L.; Lyon, D.; Junge, A.; Wyder, S.; Huerta-Cepas, J.; Simonovic, M.; Doncheva, N.T.; Morris, J.H.; Bork, P.; et al. STRING V11: Protein–Protein Association Networks with Increased Coverage, Supporting Functional Discovery in Genome-Wide Experimental Datasets. Nucleic Acids Res. 2019, 47, D607–D613. [Google Scholar] [CrossRef]
  51. Fang, S.; Zhang, L.; Guo, J.; Niu, Y.; Wu, Y.; Li, H.; Zhao, L.; Li, X.; Teng, X.; Sun, X.; et al. NONCODEV5: A Comprehensive Annotation Database for Long Non-Coding RNAs. Nucleic Acids Res. 2018, 46, D308–D314. [Google Scholar] [CrossRef] [PubMed]
  52. Glažar, P.; Papavasileiou, P.; Rajewsky, N. circBase: A Database for Circular RNAs. RNA 2014, 20, 1666–1670. [Google Scholar] [CrossRef] [PubMed]
  53. Knox, C.; Wilson, M.; Klinger, C.M.; Franklin, M.; Oler, E.; Wilson, A.; Pon, A.; Cox, J.; Chin, N.E.L.; Strawbridge, S.A.; et al. DrugBank 6.0: The DrugBank Knowledgebase for 2024. Nucleic Acids Res. 2024, 52, D1265–D1275. [Google Scholar] [CrossRef]
  54. Sherry, S.T.; Ward, M.H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E.M.; Sirotkin, K. dbSNP: The NCBI Database of Genetic Variation. Nucleic Acids Res. 2001, 29, 308–311. [Google Scholar] [CrossRef]
  55. Lipscomb, C.E. Medical Subject Headings (MeSH). Bull. Med. Libr. Assoc. 2000, 88, 265–266. [Google Scholar]
  56. Yao, D.; Zhang, L.; Zheng, M.; Sun, X.; Lu, Y.; Liu, P. Circ2Disease: A Manually Curated Database of Experimentally Validated circRNAs in Human Disease. Sci. Rep. 2018, 8, 11018. [Google Scholar] [CrossRef]
  57. Bhattacharya, A.; Cui, Y. SomamiR 2.0: A Database of Cancer Somatic Mutations Altering microRNA–ceRNA Interactions. Nucleic Acids Res. 2016, 44, D1005–D1010. [Google Scholar] [CrossRef] [PubMed]
  58. Hu, Y.; Guo, X.; Yun, Y.; Lu, L.; Huang, X.; Jia, S. DisGeNet: A Disease-Centric Interaction Database among Diseases and Various Associated Genes. Database 2025, 2025, baae122. [Google Scholar] [CrossRef] [PubMed]
  59. Ma, W.; Zhang, L.; Zeng, P.; Huang, C.; Li, J.; Geng, B.; Yang, J.; Kong, W.; Zhou, X.; Cui, Q. An Analysis of Human Microbe–Disease Associations. Briefings Bioinform. 2017, 18, 85–97. [Google Scholar] [CrossRef] [PubMed]
  60. Zhang, W.; Yue, X.; Lin, W.; Wu, W.; Liu, R.; Huang, F.; Liu, F. Predicting Drug-Disease Associations by Using Similarity Constrained Matrix Factorization. BMC Bioinform. 2018, 19, 233. [Google Scholar] [CrossRef]
  61. Sun, Y.Z.; Zhang, D.H.; Cai, S.B.; Ming, Z.; Li, J.Q.; Chen, X. MDAD: A Special Resource for Microbe-Drug Associations. Front. Cell. Infect. Microbiol. 2018, 8, 424. [Google Scholar] [CrossRef]
  62. Altman, R.B. PharmGKB: A Logical Home for Knowledge Relating Genotype to Drug Response Phenotype. Nat. Genet. 2007, 39, 426. [Google Scholar] [CrossRef]
  63. Lin, X.; Lu, Y.; Zhang, C.; Cui, Q.; Tang, Y.D.; Ji, X.; Cui, C. LncRNADisease v3.0: An Updated Database of Long Non-Coding RNA-associated Diseases. Nucleic Acids Res. 2024, 52, D1365–D1369. [Google Scholar] [CrossRef]
  64. Feng, T.; Feng, N.; Zhu, T.; Li, Q.; Zhang, Q.; Wang, Y.; Gao, M.; Zhou, B.; Yu, H.; Zheng, M.; et al. A SNP-mediated lncRNA (LOC146880) and microRNA (miR-539-5p) Interaction and Its Potential Impact on the NSCLC Risk. J. Exp. Clin. Cancer Res. 2020, 39, 157. [Google Scholar] [CrossRef]
  65. Cheng, L.; Wang, P.; Tian, R.; Wang, S.; Guo, Q.; Luo, M.; Zhou, W.; Liu, G.; Jiang, H.; Jiang, Q. LncRNA2Target v2.0: A Comprehensive Database for Target Genes of lncRNAs in Human and Mouse. Nucleic Acids Res. 2019, 47, D140–D144. [Google Scholar] [CrossRef]
  66. Zheng, Y.; Luo, H.; Teng, X.; Hao, X.; Yan, X.; Tang, Y.; Zhang, W.; Wang, Y.; Zhang, P.; Li, Y.; et al. NPInter v5.0: ncRNA Interaction Database in a New Era. Nucleic Acids Res. 2023, 51, D232–D239. [Google Scholar] [CrossRef]
  67. Liu, X.; Wang, S.; Meng, F.; Wang, J.; Zhang, Y.; Dai, E.; Yu, X.; Li, X.; Jiang, W. SM2miR: A Database of the Experimentally Validated Small Molecules’ Effects on microRNA Expression. Bioinformatics 2013, 29, 409–411. [Google Scholar] [CrossRef] [PubMed]
  68. Huang, H.Y.; Lin, Y.C.D.; Li, J.; Huang, K.Y.; Shrestha, S.; Hong, H.C.; Tang, Y.; Chen, Y.G.; Jin, C.N.; Yu, Y.; et al. miRTarBase 2020: Updates to the Experimentally Validated microRNA–Target Interaction Database. Nucleic Acids Res. 2020, 48, D148–D154. [Google Scholar] [CrossRef] [PubMed]
  69. Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; van den Berg, R.; Titov, I.; Welling, M. Modeling Relational Data with Graph Convolutional Networks. In Proceedings of the Semantic Web, Crete, Greece, 3–7 June 2018; Springer: Cham, Swizterland, 2018; pp. 593–607. [Google Scholar] [CrossRef]
  70. Hu, Z.; Dong, Y.; Wang, K.; Sun, Y. Heterogeneous Graph Transformer. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2704–2710. [Google Scholar] [CrossRef]
  71. Cen, Y.; Zou, X.; Zhang, J.; Yang, H.; Zhou, J.; Tang, J. Representation Learning for Attributed Multiplex Heterogeneous Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1358–1368. [Google Scholar] [CrossRef]
  72. Zhang, C.; Song, D.; Huang, C.; Swami, A.; Chawla, N.V. Heterogeneous Graph Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 793–803. [Google Scholar] [CrossRef]
  73. Lv, Q.; Ding, M.; Liu, Q.; Chen, Y.; Feng, W.; He, S.; Zhou, C.; Jiang, J.; Dong, Y.; Tang, J. Are We Really Making Much Progress? Revisiting, Benchmarking and Refining Heterogeneous Graph Neural Networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1150–1160. [Google Scholar] [CrossRef]
  74. Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; Yu, P.S. Heterogeneous Graph Attention Network. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2022–2032. [Google Scholar] [CrossRef]
  75. Coradduzza, D.; Cruciani, S.; Arru, C.; Garroni, G.; Pashchenko, A.; Jedea, M.; Zappavigna, S.; Caraglia, M.; Amler, E.; Carru, C.; et al. Role of miRNA-145, 148, and 185 and Stem Cells in Prostate Cancer. Int. J. Mol. Sci. 2022, 23, 1626. [Google Scholar] [CrossRef] [PubMed]
  76. Urabe, F.; Matsuzaki, J.; Yamamoto, Y.; Kimura, T.; Hara, T.; Ichikawa, M.; Takizawa, S.; Aoki, Y.; Niida, S.; Sakamoto, H.; et al. Large-Scale Circulating microRNA Profiling for the Liquid Biopsy of Prostate Cancer. Clin. Cancer Res. 2019, 25, 3016–3025. [Google Scholar] [CrossRef]
  77. Zhang, M.; Chen, Y. Link Prediction Based on Graph Neural Networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
  78. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2018, arXiv:1710.10903. [Google Scholar] [CrossRef]
Figure 1. CoupleMDA model framework. (a) Heterogeneous graph linking relationships of biological entities; (b) 13 common metapaths discovered from the heterogeneous graph; (c) heterogeneous graph message passing neural network RGCN; (d) metapath neighbor aggregation module; (e) common metapath as a structure feature guided decoder.
Figure 1. CoupleMDA model framework. (a) Heterogeneous graph linking relationships of biological entities; (b) 13 common metapaths discovered from the heterogeneous graph; (c) heterogeneous graph message passing neural network RGCN; (d) metapath neighbor aggregation module; (e) common metapath as a structure feature guided decoder.
Ijms 26 04948 g001
Figure 2. Experimental results of ROC and PRC curves for the MDA link prediction task on Zou’s dataset using CoupleMDA and baseline models: (a) ROC curve with dashed line y = x indicating random guessing; (b) PRC curve with dashed line y = 0.5 showing chance-level precision under balanced classes.
Figure 2. Experimental results of ROC and PRC curves for the MDA link prediction task on Zou’s dataset using CoupleMDA and baseline models: (a) ROC curve with dashed line y = x indicating random guessing; (b) PRC curve with dashed line y = 0.5 showing chance-level precision under balanced classes.
Ijms 26 04948 g002
Figure 3. Impact of different components on CoupleMDA performance. Each bar represents performance of model variants: CoupleMDA-link (without common metapath guidance), CoupleMDA-attn (with batch-based attention), CoupleMDA+iattn (with inner attention aggregation), CoupleMDA-HMPNN (without pre-encoding), CoupleMDA+self_super (with self-supervision), and full CoupleMDA. Performance of the above model variants on datasets DBLP, LastFM and Amazon in terms of (a) AUC and (b) AP metrics.
Figure 3. Impact of different components on CoupleMDA performance. Each bar represents performance of model variants: CoupleMDA-link (without common metapath guidance), CoupleMDA-attn (with batch-based attention), CoupleMDA+iattn (with inner attention aggregation), CoupleMDA-HMPNN (without pre-encoding), CoupleMDA+self_super (with self-supervision), and full CoupleMDA. Performance of the above model variants on datasets DBLP, LastFM and Amazon in terms of (a) AUC and (b) AP metrics.
Ijms 26 04948 g003
Figure 4. Attention Weight Distribution. Inter-class common metapath aggregation weight heatmaps for three datasets. Heatmap visualizing inter-class common metapath attention weights across three datasets. Each column represents a different metapath type; rows represent individual samples.
Figure 4. Attention Weight Distribution. Inter-class common metapath aggregation weight heatmaps for three datasets. Heatmap visualizing inter-class common metapath attention weights across three datasets. Each column represents a different metapath type; rows represent individual samples.
Ijms 26 04948 g004
Figure 5. Node Embedding Visualization. t-SNE dimensionality reduction visualization comparing node embeddings from CoupleMDA (c) and baseline model GAT (a) and HGCNMDA (b) on Zou’s datasets.
Figure 5. Node Embedding Visualization. t-SNE dimensionality reduction visualization comparing node embeddings from CoupleMDA (c) and baseline model GAT (a) and HGCNMDA (b) on Zou’s datasets.
Ijms 26 04948 g005
Figure 6. Case study of miRNA prediction related to PCa (a) Volcano plot of differential expression of all 1755 miRNAs, with red dots representing significantly upregulated RNAs and green dots representing significantly downregulated RNAs. (b) Validation results of the prediction in the dbDEMC database, where green nodes and edges indicate MDA verified in the dbDEMC database, red nodes represent RNAs considered significantly differentially expressed in the differential expression analysis of the GSE112264 dataset, and gray nodes and edges indicate MDA without experimental validation. All miRNA prefixes “hsa-miR” have been omitted. (c) Top 50 differentially expressed miRNAs in PCa samples. Orange blocks represent low-expression RNAs, and purple blocks represent high-expression RNAs.
Figure 6. Case study of miRNA prediction related to PCa (a) Volcano plot of differential expression of all 1755 miRNAs, with red dots representing significantly upregulated RNAs and green dots representing significantly downregulated RNAs. (b) Validation results of the prediction in the dbDEMC database, where green nodes and edges indicate MDA verified in the dbDEMC database, red nodes represent RNAs considered significantly differentially expressed in the differential expression analysis of the GSE112264 dataset, and gray nodes and edges indicate MDA without experimental validation. All miRNA prefixes “hsa-miR” have been omitted. (c) Top 50 differentially expressed miRNAs in PCa samples. Orange blocks represent low-expression RNAs, and purple blocks represent high-expression RNAs.
Ijms 26 04948 g006
Figure 7. Simplified schematic diagram of the topological structure of the constructed heterogeneous graph of biological entities, with a specific disease as the source node and a specific miRNA as the target node.
Figure 7. Simplified schematic diagram of the topological structure of the constructed heterogeneous graph of biological entities, with a specific disease as the source node and a specific miRNA as the target node.
Ijms 26 04948 g007
Table 1. Data source of biological entity association data.
Table 1. Data source of biological entity association data.
AssociationDatabaseAssociationDatabase
circleRNA-diseaseCirc2Disease [56]circRNA-diseaseSomamiR [57]
disease-mRNADisGeNet [58]disease-microbeHMDAD [59]
drug-diseaseSCMFDD [60]drug-microbeMDAD [61]
drug-mRNAPharmGKB [62]drug-proteinDrugBank [53]
lncRNA-diseaseLncRNADisease [63]lncRNA-miRNASNP [64]
lncRNA-mRNALncRNA2Target [65]lncRNA-proteinNPInter [66]
miRNA-drugSM2miR [67]miRNA-mRNAmiRTarBase [68]
miRNA-proteinMHGTMDA [16]mRNA-proteinMHGTMDA [16]
Table 2. Experiment results (%) on the DBLP, LastFM, and Amazon dataset for the link prediction task. The best performance for each metric in the table is highlighted in bold.
Table 2. Experiment results (%) on the DBLP, LastFM, and Amazon dataset for the link prediction task. The best performance for each metric in the table is highlighted in bold.
ModelsDBLPLastFMAmazon
AUC AP AUC AP AUC AP
RGCN87.63 ± 0.1489.92 ± 0.2581.70 ± 0.3986.27 ± 0.3381.90 ± 0.5477.89 ± 0.42
HGT87.78 ± 0.2389.14 ± 0.6680.97 ± 0.5483.41 ± 0.6586.56 ± 4.0784.12 ± 5.28
GATNE75.63 ± 0.5876.97 ± 0.3786.62 ± 0.2086.83 ± 0.2196.58 ± 0.8696.28 ± 0.41
HetGNN84.31 ± 0.4385.27 ± 0.5988.16 ± 0.2589.61 ± 0.3295.71 ± 0.6894.65 ± 0.82
Simple-HGN87.89 ± 0.0789.37 ± 0.0884.73 ± 0.0687.35 ± 0.0597.57 ± 0.3396.88 ± 0.46
HAN88.15 ± 0.8289.33 ± 1.1788.16 ± 0.2089.33 ± 0.1897.16 ± 0.5395.87 ± 0.74
MAGNN89.76 ± 1.0390.72 ± 1.1288.89 ± 2.5689.49 ± 1.7598.43 ± 0.1398.06 ± 0.21
NCNC89.99 ± 1.4491.81 ± 1.4191.19 ± 0.3491.56 ± 0.4698.49 ± 0.1998.35 ± 0.18
CoupleMDA92.19 ± 0.1093.04 ± 0.0994.11 ± 0.1394.64 ± 0.1599.04 ± 0.1698.89 ± 0.18
Table 3. Experiment results (%) on the test set of Zou’s dataset for the MDA link prediction task. The best performance for each metric in the table is highlighted in bold.
Table 3. Experiment results (%) on the test set of Zou’s dataset for the MDA link prediction task. The best performance for each metric in the table is highlighted in bold.
ModelAUCAPPrecisionRecallF1 Score
GATNE75.70 ± 0.6477.32 ± 1.2556.31 ± 1.3799.66 ± 0.3371.96 ± 1.36
HetGNN90.06 ± 0.3589.60 ± 0.5884.21 ± 0.5978.46 ± 1.0881.24 ± 0.17
Simple-HGN94.16 ± 0.8693.02 ± 1.0782.10 ± 2.5794.32 ± 1.1987.79 ± 1.99
HAN93.95 ± 0.9792.15 ± 0.3886.71 ± 1.4488.25 ± 0.5187.48 ± 0.39
MAGNN94.63 ± 0.6793.97 ± 0.4186.78 ± 1.0991.79 ± 0.7289.21 ± 1.33
NCNC94.96 ± 0.5594.42 ± 0.4285.47 ± 1.0594.62 ± 1.2189.30 ± 1.27
HGCNMDA93.92 ± 0.1892.71 ± 0.5889.64 ± 0.1888.35 ± 0.7688.99 ± 0.31
MHGTMDA94.64 ± 0.3193.51 ± 0.7788.63 ± 0.4389.97 ± 1.3989.29 ± 0.56
CoupleMDA95.36 ± 0.3594.84 ± 0.3290.51 ± 1.6991.92 ± 1.5590.21 ± 0.58
Table 4. Quantitative results of ablation study (%) on the test set of DBLP, LastFM, and Amazon datasets. The best performance for each metric in the table is highlighted in bold.
Table 4. Quantitative results of ablation study (%) on the test set of DBLP, LastFM, and Amazon datasets. The best performance for each metric in the table is highlighted in bold.
VariantDBLPLastFMAmazon
AUC AP AUC AP AUC AP
CoupleMDA-link88.84 ± 0.6090.18 ± 1.7686.95 ± 0.4587.95 ± 0.5093.57 ± 0.4693.63 ± 0.71
CoupleMDA-attn87.36 ± 0.9688.48 ± 0.4993.00 ± 2.5993.32 ± 2.6394.18 ± 0.8193.99 ± 0.90
CoupleMDA+iattn91.74 ± 0.2992.54 ± 0.3593.36 ± 0.2193.81 ± 0.2798.31 ± 0.1698.07 ± 0.17
CoupleMDA-HMPNN91.27 ± 0.1692.41 ± 0.1193.89 ± 0.1094.48 ± 0.0998.42 ± 0.1598.27 ± 0.14
CoupleMDA+self_super91.78 ± 0.1992.67 ± 0.2294.00 ± 0.1594.36 ± 0.1998.57 ± 0.1498.42 ± 0.16
CoupleMDA92.19 ± 0.2093.04 ± 0.1994.11 ± 0.1394.64 ± 0.1599.04 ± 0.1698.89 ± 0.18
Table 5. Quantitative results of ablation study (%) on the test set of Zou’s dataset. The best performance for each metric in the table is highlighted in bold.
Table 5. Quantitative results of ablation study (%) on the test set of Zou’s dataset. The best performance for each metric in the table is highlighted in bold.
VariantAUCAPPrecisionRecallF1 Score
CoupleMDA-link94.84 ± 0.5994.31 ± 0.3184.70 ± 0.4094.16 ± 0.5189.18 ± 0.96
CoupleMDA-attn94.77 ± 0.7294.21 ± 0.7987.03 ± 1.0691.66 ± 0.5689.29 ± 1.34
CoupleMDA+iattn95.16 ± 0.4794.64 ± 0.4286.90 ± 1.0793.92 ± 0.6989.34 ± 0.46
CoupleMDA+self_super94.82 ± 0.7594.57 ± 0.0987.17 ± 1.1592.32 ± 0.8589.67 ± 1.28
CoupleMDA-HMPNN94.96 ± 0.5794.20 ± 0.6886.49 ± 1.1891.66 ± 0.3189.00 ± 0.25
CoupleMDA95.35 ± 0.4594.84 ± 0.3290.51 ± 1.6991.92 ± 1.5590.21 ± 0.58
Table 6. Twelve node-specific metapaths and thirteen common metapaths discovered from heterogeneous graphs.
Table 6. Twelve node-specific metapaths and thirteen common metapaths discovered from heterogeneous graphs.
Node-Specific MetaPathsCommon MetaPaths
miRNA–mRNA–miRNAmiRNA–drug–disease
miRNA–drug–miRNAmiRNA–mRNA–disease
miRNA–circRNA–miRNAmiRNA–lncRNA–disease
miRNA–lncRNA–miRNAmiRNA–circRNA–disease
miRNA–protein–miRNAmiRNA–disease–miRNA–disease
miRNA–disease–miRNAmiRNA–drug–mRNA–disease
disease–mRNA–diseasemiRNA–drug–microbe–disease
disease–drug–diseasemiRNA–mRNA–drug–disease
disease–circRNA–diseasemiRNA–mRNA–lncRNA–disease
disease–lncRNA–diseasemiRNA–protein–drug–disease
disease–microbe–diseasemiRNA–protein–mRNA–disease
disease–miRNA–diseasemiRNA–protein–lncRNA–disease
miRNA–lncRNA–mRNA–disease
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Chen, G.; Tan, G.; Chen, C.Y.-C. CoupleMDA: Metapath-Induced Structural-Semantic Coupling Network for miRNA-Disease Association Prediction. Int. J. Mol. Sci. 2025, 26, 4948. https://doi.org/10.3390/ijms26104948

AMA Style

Li Z, Chen G, Tan G, Chen CY-C. CoupleMDA: Metapath-Induced Structural-Semantic Coupling Network for miRNA-Disease Association Prediction. International Journal of Molecular Sciences. 2025; 26(10):4948. https://doi.org/10.3390/ijms26104948

Chicago/Turabian Style

Li, Zhuojian, Guanxing Chen, Guang Tan, and Calvin Yu-Chian Chen. 2025. "CoupleMDA: Metapath-Induced Structural-Semantic Coupling Network for miRNA-Disease Association Prediction" International Journal of Molecular Sciences 26, no. 10: 4948. https://doi.org/10.3390/ijms26104948

APA Style

Li, Z., Chen, G., Tan, G., & Chen, C. Y.-C. (2025). CoupleMDA: Metapath-Induced Structural-Semantic Coupling Network for miRNA-Disease Association Prediction. International Journal of Molecular Sciences, 26(10), 4948. https://doi.org/10.3390/ijms26104948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop