1. Introduction
MicroRNAs (miRNAs), a class of approximately 22-nucleotide non-coding RNA molecules, regulate key biological processes including cell proliferation, differentiation, and apoptosis through targeted gene expression modulation [
1,
2,
3]. Their mechanisms of action encompass mRNA degradation induction, translational repression, and epigenetic regulation [
4]. Studies have demonstrated that aberrant miRNA expression correlates with various diseases: for instance, downregulation of miR-103/107 associates with amyloid-beta deposition in Alzheimer’s disease [
5], while miR-105 shows elevated expression in breast cancer [
6]. These findings underscore miRNAs’ potential as disease biomarkers [
7].
Conventional wet-lab methods for identifying MDAs face critical bottlenecks, including prolonged experimental cycles and high per-test costs [
8]. This has driven the development of bioinformatics approaches based on the functional association hypothesis, which posits that phenotypically similar diseases are associated with functionally related miRNAs. Capitalizing on the strong representation learning capabilities of deep learning in graph-structured data processing, GNN-based frameworks have emerged as cutting-edge solutions for discovering potential MDAs in complex biological networks [
9,
10]. Among these, heterogeneous graph attention network variants that learn topological features through node embedding and neighborhood aggregation have shown particular promise. These methods generally follow two operational phases: (1) data acquisition and construction of miRNA-disease similarity matrices and (2) graph structure construction with GNN-based neighbor feature aggregation, followed by decoding through feed-forward neural networks.
Current methodologies for constructing miRNA similarity matrices exhibit significant methodological homogeneity and operational risks of data leakage. First, the similarity matrices derived from miRNA sequences and gene interaction networks are inherently sparse, necessitating supplementation through either filling [
11,
12,
13,
14,
15] or concatenation [
16,
17] with Gaussian interaction profile (GIP) kernel similarity matrices. Analysis of methodology descriptions and open-source implementations in references [
11,
12,
13,
14,
15,
16,
17] reveals that these approaches typically compute GIP-based similarities using the complete known MDA matrix to generate embeddings, yet directly employ these embeddings in cross-validation without masking test samples. This operational flaw allows positive test samples to participate in similarity computation as prior knowledge. Ablation experiments in MHGTMDA [
16] demonstrate that removing node embeddings generated from unmasked GIP data causes predictive AUC to plummet from 96% to 74%, confirming substantial performance inflation from data leakage. Notably, most open-source implementations store fixed miRNA similarity matrices for subsequent cross-validation, inherently introducing data leakage. Our experimental results indicate that under such operational protocols, even simple multilayer perceptrons (MLPs) achieve inflated performance metrics, which fail to reflect models’ true predictive capabilities. Although DAEMKL [
18] and MGADAE [
19] circumvent GIP-induced leakage through autoencoder architectures, the intrinsic parameter sensitivity of GIP methods and their heavy reliance on known network structures hinder models’ ability to capture nonlinear high-order relationships, frequently resulting in overfitting or underfitting.
Secondly, existing studies [
11,
12,
13,
14,
18,
19,
20] predominantly utilize the miRNA functional similarity matrix MISIM developed by Wang et al. (2010) [
21], where outdated data represents only a secondary limitation compared with its fundamental methodological flaws. The core issue lies in circular logic: MISIM calculates miRNA functional similarity using pre-existing MDA data, while subsequent research directly incorporates this similarity matrix as input features for MDA prediction models. This approach creates tight coupling between feature and target spaces, where similarity metrics derived from known associations are used to predict homologous relationships, essentially forming a self-validation loop. Such closed-loop designs may lead to over-optimistic model evaluation, significantly compromising real-world generalization capability. Although GRNMF [
22] proposed inferring miRNA similarity through gene functional interaction networks (later adopted by MSGCL [
23]) to avoid circular reasoning, its reliance on set-wise maximum similarity matching constitutes a manually designed heuristic algorithm capable of capturing only shallow linear relationships.
Consequently, constructing a robust miRNA similarity matrix faces inherent challenges due to deficient data sources: resultant matrices inevitably suffer from either (1) sparsity issues, (2) data leakage with self-validation loops, or (3) limited expressiveness and weak generalization. Similar fundamental constraints equally apply to disease similarity matrix construction.
Various molecular entities, such as proteins, genes, and lncRNAs, provide multi-level biological insights from complementary perspectives. Leveraging additional biological entities as mediators connecting miRNAs and diseases not only addresses data incompleteness in single-molecule analyses but also enhances predictive models with enriched input features [
16,
24]. HGCNMDA [
24] introduces a gene layer to construct a miRNA-gene-disease heterogeneous network, learning feature embeddings through GNN models. MUSCLE [
17] establishes three heterogeneous networks using drugs, mRNAs, and lncRNAs as mediators between miRNAs and diseases, with feature embeddings learned via graph attention neural networks. MHGTMDA [
16] integrates eight biological entities (including lncRNAs, circRNAs, and proteins) to build a heterogeneous biological graph containing 16 association types, employing heterogeneous graph transformers for feature encoding. Although these methods construct relatively comprehensive graph data structures, they exhibit three critical limitations: (1) The models learn node representations by aggregating multi-hop neighborhood features based solely on edge-type relationships, failing to capture structural information between links through features like common neighbors; (2) The absence of metapath-guided neighbor aggregation prevents intuitive interpretation of captured semantic relationships; (3) Over-simplified decoders relying on basic Hadamard products inadequately process neighbor embeddings. This architectural design results in significant information loss during both encoding and decoding processes.
The prediction of MDAs inherently constitutes a link prediction problem in heterogeneous information networks, where model architecture design must synergistically integrate network heterogeneity and topological characteristics of link prediction, a critical factor enabling high-performance prediction in this study. Metapath-based heterogeneous GNN methods overcome the limitation of flattening heterogeneous relationships in conventional approaches through self-consistent semantic aggregation strategies that capture structural information within metapath subgraphs and fuse multi-semantic features [
25,
26]. A metapath, defined as an ordered sequence of node and edge types, effectively characterizes multi-level semantic relationships in heterogeneous graphs [
27]. In biological entity heterogeneous graphs, numerous biologically meaningful metapaths exist. For instance, the “miRNA → lncRNA → mRNA → disease” metapath reflects the biological semantics where miRNAs regulate disease pathways through ceRNA mechanisms: miRNAs competitively bind with long non-coding RNAs (lncRNAs), thereby releasing their target mRNAs from suppression [
28]. A concrete example involves MALAT1 sequestering the miR-200 family in lung cancer, which elevates ZEB1 mRNA expression and promotes epithelial–mesenchymal transition (EMT) [
29,
30]. Other representative metapaths include the following:
miRNA → mRNA → disease: direct miRNA regulation of aberrant gene expression induces pathogenesis [
31].
miRNA → drug → disease: therapeutic drugs exert effects by modulating miRNA expression [
32].
miRNA → circRNA → mRNA → disease: the circRNA-miRNA-mRNA axis plays crucial regulatory roles in diseases [
33].
These heterogeneous metapaths connecting node pairs of different types establish informational bridges between biological entities, enabling models to capture rich biological semantics while enhancing logical consistency and interpretability.
Recent advances in link prediction research have enhanced the representation of heterogeneous relationships by integrating structural features (SF) into GNNs. These SFs include common neighbors [
34] and resource allocation [
35], which enable joint semantic-topological modeling and improve heterogeneous relationship characterization [
36]. However, current methods predominantly employ high-order encoders to process high-dimensional semantic and topological node features while relying on simplistic Hadamard products for node representation reconstruction in decoders. This imbalanced design creates an information bottleneck during encoding compression, ultimately leading to unavoidable information dissipation in the decoding phase [
37,
38].
To address these limitations, we propose CoupleMDA—a robust heterogeneous graph link prediction framework for biological entities that simultaneously captures structure-semantics coupled features of heterogeneous connections between miRNA and disease nodes. Specifically, we leverage the large-scale, heterogeneous graph of eight biological entities constructed by Zou [
16] to mine metapaths and develop our model. Our framework autonomously learns the topological structure of heterogeneous graphs and infers target edge connections without over-reliance on heuristic algorithms with human-defined rules or biased prior knowledge. During data preprocessing, we strictly implement dataset partitioning and metapath generation to eliminate data leakage and self-validation loops. The proposed methodology operates through three phases, as shown in
Figure 1:
- (1)
Primary encoding: employ RGCN to pre-encode all nodes in the original heterogeneous graph, establishing comprehensive preliminary topological relationships.
- (2)
Semantic augmentation: perform secondary encoding by fusing metapath semantic features on target edge nodes, enriching node embeddings with homogeneous neighborhood semantics.
- (3)
Structural decoding: guide structural decoding using common metapaths (CM) between target node pairs.
The intrinsic semantic information of metapaths combined with structural features from common metapaths enhances the framework’s coupling of graph structure and biological semantics. Guided by this design, we conduct extensive experiments to evaluate CoupleMDA’s superior performance against state-of-the-art baselines.
The contributions of our work are fourfold:
- (1)
We propose a novel framework for MDA prediction that automatically identifies diverse metapath weights (e.g., miRNA → lncRNA → mRNA → disease) from biological heterogeneous graphs, capturing cross-entity regulatory semantics to enhance biological interpretability.
- (2)
A new decoding mechanism incorporating common metapath structural features through attention-based dynamic weighting of multi-semantic pathways, enabling joint semantic-topological decoding to mitigate information bottlenecks.
- (3)
We abandon conventional similarity matrix construction methods (e.g., MISIM) that rely on known MDA data, instead utilizing non-associated data sources to eliminate feature-target space coupling and resolve self-validation loops.
- (4)
Extensive experiments demonstrate CoupleMDA’s significant outperformance over non-leakage state-of-the-art baselines across evaluation metrics, confirming its effectiveness and generalization capability in diverse scenarios.
3. Preliminary
In this section, we present key concepts and formal definitions related to heterogeneous graph link prediction.
Definition 1 (Heterogeneous Graph)
. A heterogeneous graph is formally defined as , where is the set of nodes, each belonging to a predefined node type set . is the set of edges, each belonging to a predefined edge type set . is the node type mapping function, and is the edge type mapping function. This definition characterizes the diversity of nodes and edges in heterogeneous graphs.
Definition 2 (Metapath)
. A metapath P is formally defined as a relation pattern on a heterogeneous graph: , where is a sequence of node types in the heterogeneous graph, and is a sequence of edge types connecting adjacent node types. Metapaths reflect complex semantic associations between different types of nodes in heterogeneous graphs.
Definition 3 (Link Prediction)
. The goal of link prediction is to predict edges that do not yet exist in the graph. Given a pair of nodes , we aim to predict whether an edge will form between them. We formalize this task as , where is the predicted value of whether an edge exists between nodes and . This definition transforms the link prediction problem into a binary classification task based on node and relation types.
4. Results
This section evaluates the effectiveness of CoupleMDA through extensive experiments. Specifically, we first introduce the experimental setup, including datasets, evaluation metrics, and baseline methods. Subsequently, we provide a detailed analysis of the experimental results and validate the contributions of individual model components through ablation studies.
4.1. Datasets
To comprehensively evaluate the model’s performance on heterogeneous graph link prediction tasks, we first tested it on three representative public heterogeneous graph datasets. Then, we applied CoupleMDA to a previously constructed biological heterogeneous information network for MDA prediction testing.
The three heterogeneous graph datasets mentioned above are DBLP, LastFM, and Amazon. These datasets all use versions organized by HGB [
31]. The specific details of each dataset are as follows:
DBLP: This dataset is a subset of an academic network, which, after cleaning, contains rich academic elements: it covers 4057 researchers and their 14,328 published academic papers, along with 7723 professional terms and 20 types of academic publishing institutions.
LastFM: This is a music social network dataset used to track and analyze users’ music listening behaviors. It continuously collects cross-platform music playback behavior feature data from users. After information cleaning and standardization, the constructed graph dataset covers multiple dimensions: 1892 listeners, 17,632 music creator accounts, and 1088 associated genre tags.
Amazon: This dataset comes from user behavior data on an e-commerce platform. It includes product attributes and co-browsing and co-purchasing links between products. Product attributes include price, sales ranking, brand, and category.
We constructed our biological heterogeneous information network based on the dataset from Zou [
16]. The training data were derived from the MDA database in HMDDv3.2, from which we selected 901 miRNAs and 877 diseases to build an adjacency matrix. To enhance the dataset, we integrated updated MDA data from HMDDv4.0 [
49] and merged diseases with semantic similarities from the HMDD database. We added 75% of the MDAs as the training set to the graph. Other biological entities incorporated include 3348 protein nodes from the STRING database [
50], 2633 lncRNA nodes from NONCODEV5 [
51], 421 circRNA nodes from CircBase [
52], 1319 drug nodes from DrugBank [
53], 3024 mRNA nodes from the NCBI database [
54], and 100 microbial nodes from the NIH Medical Subject Headings (MeSH) database [
55]. Finally, an equivalent number of non-MDAs were randomly selected as negative controls. The data on associations between biological entities were collected from existing public databases and compiled by Zou et al. [
16]. The relevant database sources are listed in
Table 1.
4.2. Comparison Methods
To comprehensively evaluate the performance of CoupleMDA, we selected representative heterogeneous graph neural network models and MDA prediction models as baselines. These baseline models can be categorized into three classes:
4.2.1. General Heterogeneous Graph Neural Networks
RGCN [
69]: This model extends traditional graph convolutional network (GCN) to heterogeneous graph scenarios by performing convolutional operations separately on different types of edges, followed by weighted aggregation. Its core innovation lies in introducing specific weight matrices for each relation type, effectively handling multi-type edge relationships in heterogeneous graphs.
HGT [
70]: This model introduces a heterogeneous graph transformer architecture, processing different types of nodes and edges through a heterogeneous attention mechanism. Its key feature is leveraging multi-type edge information as weights for message passing and capturing long-range dependencies via a metapath-guided transformer structure.
GATNE [
71]: This method learns three distinct embeddings for each node: general embeddings, edge-type-specific embeddings, and attribute-enhanced embeddings. By fusing these embeddings, the model captures both global node characteristics and relation-specific features.
HetGNN [
72]: This model employs type-aware random walk strategies to sample heterogeneous neighbors and processes heterogeneous information through a two-layer aggregation mechanism. The first layer aggregates nodes of the same type, while the second layer fuses features across different types, followed by end-to-end optimization of node representations.
Simple-HGN [
73]: This model enhances graph attention networks (GAT) performance by introducing three key components: learnable type embeddings, residual connections, and L2 normalization of output embeddings. These simple yet effective modifications significantly improve performance on heterogeneous graphs.
HAN [
74]: This model designs hierarchical attention mechanisms, including node-level attention and semantic-level attention. Node-level attention aggregates information within individual metapaths, whereas semantic-level attention integrates semantic information from multiple metapaths.
MAGNN [
26]: The primary innovation of this model lies in preserving and leveraging intermediate node information in metapaths. By designing a metapath instance encoder, the model captures complete path semantics rather than relying solely on path endpoints.
4.2.2. MDA Prediction Models
HGCNMDA [
24]: This method introduces a gene layer to construct a miRNA-gene-disease heterogeneous network. It employs a multi-relational GCN to encode representations for given miRNAs, diseases, or genes, integrating neighbor representations according to distinct relation types.
MHGTMDA [
16]: This approach constructs a heterogeneous biological entity graph containing eight biomolecule types to comprehensively model indirect associations between miRNAs and diseases. The model processes diverse node and edge types through a heterogeneous attention mechanism, utilizing multi-type edge information as message-passing weights and capturing long-range dependencies via a metapath-guided transformer architecture.
4.2.3. Link Prediction Models
NCNC [
48]: This model proposes an MPNN-then-SF framework and introduces the concept of neural common neighbors. By combining message passing with structural features, it significantly improves link prediction performance. Additionally, the model incorporates mechanisms to address graph incompleteness, thereby enhancing prediction generalizability.
4.3. Experimental Setup
For data partitioning, we randomly divided all existing edges (positive samples) in the dataset into training, validation, and test sets with a ratio of 75:10:15. Negative samples were sampled from node pairs that did not have any existing edges in the graph at a 1:1 ratio. To prevent information leakage, the training graph only contained positive edges from the training set, while positive edges in the validation and test sets were removed from the graph during training. We use the results on the test set as the final criterion for evaluating the performance of the model.
Model performance was evaluated using the receiver operating characteristic area under curve (ROC-AUC) and average precision (AP) scores. Additional core classification metrics included precision, recall, and F1 score:
where
denotes true positives,
represents false positives, and
indicates false negatives.
The AUC metric measures the probability that a classifier ranks random positive samples higher than random negative samples:
where
M and
N denote the numbers of positive and negative samples,
represents the prediction score of the
i-th positive sample,
indicates the prediction score of the
j-th negative sample, and
is an indicator function that returns 1 when the condition is satisfied. The AP metric approximates the area under the precision–recall curve using the trapezoidal rule:
where
and
represent the recall and precision at the
k-th threshold.
For experimental configuration, all models used the same five random seeds for dataset partitioning and training, with five independent experimental repetitions. Each different random seed will divide different training and validation sets, but the fixed test set remains invisible to the model before final evaluation. We employed the Adam optimizer with weight decay for parameter optimization and tuned hyperparameters based on validation set performance. All baseline implementations were adapted from their official codebases, with modifications made to their data loading interfaces and downstream decoders. Experiments were conducted on an NVIDIA RTX 4090D GPU with 24 GB of memory.
4.4. Performance Analysis
This section first presents the experimental results of different models on heterogeneous graph link prediction tasks using the DBLP, LastFM, and Amazon datasets to demonstrate the superiority of the CoupleMDA model on general datasets. It then shows the experimental results of different models on MDA prediction tasks using the previously constructed biological heterogeneous information network to verify the robustness of CoupleMDA on specialized datasets.
The experimental results for link prediction on general heterogeneous graph datasets are shown in
Table 2. Experiments demonstrate that CoupleMDA consistently outperforms other baselines across different datasets. Compared with the best baseline (NCNC), CoupleMDA achieves a performance improvement of approximately 2–5%, indicating that co-metapaths, as structural information guiding GNN pooling, can maximize information capture. From the performance of different types of methods, metapath-based approaches generally outperform relation-based methods, suggesting that metapaths better capture semantic information in heterogeneous graphs. Further analysis reveals that CoupleMDA performs exceptionally well on real-world datasets with well-structured graph data. This is because, for graphs with fewer isolated edges, the more complete the neighborhood connections of the target edge, the better the model can leverage the coupled advantages of structure and semantics to deliver more accurate predictions.
The experimental results for MDA link prediction on the test set are shown in
Table 3 and
Figure 2. The performance of different models on the training set and validation set, as well as detailed results on the test set, are provided in
Appendix A.1. Specifically,
Table A1 records the average metrics of five experiments for different baseline models on the training, validation, and test sets of the Zou dataset.
Table A2,
Table A3,
Table A4,
Table A5,
Table A6,
Table A7,
Table A8,
Table A9,
Table A10 each record the TP, TN, FP, and FN values for each baseline model on the test set in each experiment. Notably, HGCNMDA and MHGTMDA were evaluated using their models on our reconstructed dataset after removing data that could cause information leakage.
The results demonstrate that CoupleMDA consistently outperforms other baselines across different datasets, achieving approximately 0.5% improvement over the strongest baseline (NCNC). CoupleMDA exhibits superior performance on real-world datasets with well-structured graphs because it can better leverage the coupling advantages of structural and semantic features when target edges have comprehensive neighborhood connections. However, Zou’s dataset contains numerous isolated nodes where certain nodes have significantly more connections in specific categories than others, which obscures structural features and limits CoupleMDA’s potential. Despite this limitation, CoupleMDA still effectively captures semantic information and basic connectivity patterns, outperforming other state-of-the-art models. Crucially, methods that incorporate structural information during decoding (e.g., NCNC and CoupleMDA) surpass approaches that focus solely on encoding node embeddings. This observation indicates that considering structural features during decoding is essential, as processing node features in isolation may neglect edge-specific structural correlations.
4.5. Ablation Experiment
To investigate the contributions of different components, we conducted ablation experiments on various CoupleMDA variants.
Table 4 and
Figure 3 present the ablation results of different variants on the DBLP, LastFM, and Amazon datasets.
Table 5 shows the ablation results of different variants on Zou’s dataset. Model variants’ performance across training/validation/test sets on the Zou dataset (averaged over five trials) is detailed in
Appendix A.2 (
Table A11), with per-trial test-set TP/TN/FP/FN metrics in
Table A12,
Table A13,
Table A14,
Table A15,
Table A16. Specifically, we designed the following variants:
CoupleMDA-link: removes the co-metapath guided decoding mechanism that incorporates structural features.
CoupleMDA-attn: replaces the inter-metapath self-attention with batch-based additive attention.
CoupleMDA+iattn: substitutes the simple mean aggregation with intra-metapath attention.
CoupleMDA+self_super: incorporates self-supervised learning for node-type and metapath-type classification.
CoupleMDA-HMPNN: removes the HMPNN pre-encoding mechanism.
The results reveal performance degradation across all variants when specific components are removed or self-supervised learning is added. CoupleMDA’s superiority primarily stems from its co-metapath mechanism that couples semantic and structural information. The inter-metapath self-attention aggregation demonstrates more robust performance than batch-based additive attention, as it remains unaffected by batch configurations or sample ordering. Other components contribute marginally to the performance, while the addition of self-supervised learning fails to improve results. Although HMPNN provides limited enhancement to final prediction accuracy, it significantly accelerates model convergence during training.
4.6. Visualization Analysis
To investigate the internal mechanisms of our model, we visualize its intermediate results.
4.6.1. Inter-Class Common Metapath Aggregation Weights
We extract the self-attention weights for inter-class common metapath aggregation, as shown in the heatmap of
Figure 4. Each sample receives independent attention weights, where higher values indicate greater contribution of specific metapaths to model decisions. We group the weights for positive and negative samples separately.
For positive samples, the weights exhibit a skewed distribution, suggesting that model decisions primarily rely on specific common metapaths. Notably, the miRNA-mRNA-disease metapath provides the highest contribution, followed by miRNA-drug-disease and miRNA-circRNA-disease metapaths, while the miRNA-disease-miRNA-disease metapath shows minimal contribution. This pattern aligns with biological reality since the miRNA-mRNA-disease metapath contains richer connectivity in heterogeneous biological graphs. Our analysis indicates that effective common metapaths expand the receptive field of common neighbors, enabling the model to perceive broader topological structures without losing critical connection details.
In contrast, negative samples in
Figure 4 display uniformly distributed weights, suggesting either insufficient structural information capture by the model or the absence of common metapaths for these unconnected node pairs, which consequently receive lower link prediction probabilities.
4.6.2. Encoded Vector Visualization
We qualitatively evaluate the node representations learned by CoupleMDA against baseline models GAT and HGCNMDA through t-SNE visualization of their final-layer encoded vectors on Zou’s dataset. As shown in
Figure 5, CoupleMDA-generated embeddings demonstrate significantly clearer cluster structures, exhibiting compact intra-class clustering and distinct inter-class boundaries in the 2D projection space. Different categories are separated by substantial margins, indicating superior discriminative power compared with the more overlapping distributions produced by baseline models.
4.7. Biomedical Case Study
To systematically evaluate the biomedical utility of CoupleMDA, we first performed differential expression analysis of miRNAs between prostate cancer (PCa) tissues and benign prostate tissues and then investigated the correlation between model predictions and differentially expressed miRNAs through statistical methods. Prostate cancer remains one of the most frequently diagnosed malignancies in males, where miRNAs serve as crucial biomarkers with specific expression profiles and therapeutic potential [
75].
The differential expression data were obtained from the GSE112264 dataset [
76], containing 809 PCa tissue samples and 241 benign prostate tissue samples. Through volcano plot visualization, we identified 1755 differentially expressed miRNAs (
Figure 6a), with the top 50 most significant expression shifts displayed in a heatmap (37 upregulated and 13 downregulated,
Figure 6c). We selected PCa-associated miRNAs using established thresholds: |log
2 FC| > 1 with adjusted
p-value < 0.05. miRNAs meeting these criteria were classified as significantly differentially expressed, while others were labeled non-differentially expressed. This selection protocol integrated twofold expression changes and rigorous statistical validation (FDR < 5%) to ensure reliability.
CoupleMDA achieved prediction accuracies of AUC = 91% and AP = 85% when evaluating associations between all 1755 miRNAs and PCa. We validated the top 42 model-predicted PCa-related miRNAs against the dbDEMC database, as illustrated in
Figure 6b. Among these, twenty-seven miRNAs received experimental support from existing literature (dbDEMC-EXP00020, EXP00032, EXP00042, EXP00044, EXP00045, EXP00046, EXP00403, EXP00469, EXP00639, EXP00640), while nine miRNAs overlapped with significantly differentially expressed RNAs identified in the GSE112264 analysis. This case study demonstrates the practical value and effectiveness of the proposed CoupleMDA method.
5. Methodology
This section provides a detailed description of the proposed CoupleMDA framework. The model represents a novel structure-semantics-coupled metapath aggregation approach for MDA link prediction in biological heterogeneous graphs. CoupleMDA consists of three core components: (1) metapath discovery, (2) an RGCN-based node pre-encoding module [
69], and (3) a secondary encoding module based on metapath aggregation and a common metapath decoding module.
Figure 1 illustrates these primary components and their workflow.
5.1. Overview of Model Framework
In link prediction, SF represented by common neighbors and path information are the core reference basis. The MPNN-then-SF architecture proposed by Wang et al. [
48] adopts a two-stage design: first executing an MPNN on the original graph, then leveraging structural features to guide the pooling of MPNN-derived features. Inspired by this framework, our model generalizes the MPNN-then-SF approach from homogeneous to heterogeneous graphs, as illustrated in
Figure 1.
Figure 1a displays the connection types in the constructed biological heterogeneous information network. Each category of biological entity contains multiple specific instance nodes, where each node connects to others according to the relationship types shown in
Figure 1a. For example, the miRNA-type node “hsa-mir-122-5p” is linked to the mRNA-type node “nm_003045.5” in the database.
The workflow of CoupleMDA consists of four stages:
Stage 1 (
Figure 1b): Discover all metapath types of lengths 2 and 3 from the biological heterogeneous information network. For each metapath type, recursively traverse all nodes to identify all node-level instances and store them.
Stage 2 (
Figure 1c): learn node representations by aggregating multi-hop neighborhood features based on heterogeneous edge types in the original graph, enabling node embeddings to capture preliminary structural relationships.
Stage 3 (
Figure 1d): for each target node pair, generate two node-specific embeddings by aggregating their individual metapaths and one common metapath embedding by aggregating common metapaths between the pair.
Stage 4 (
Figure 1e): Input the Hadamard product of the two node embeddings and the weighted common metapath embedding into a decoder to produce the link prediction. Subsequent sections will elaborate on the core components of the model.
This framework offers two key advantages: First, the common metapath serves as a dominant structural feature to guide link prediction. Second, the pre-encoded node embeddings from the RGCN encapsulate preliminary connectivity and interaction patterns in the graph, thereby enhancing downstream structural feature embeddings. The strong coupling between RGCN and SF ensures high expressiveness for link prediction tasks. Furthermore, the RGCN-based pre-encoding addresses incomplete metapath issues caused by missing links [
41].
5.2. MetaPath Discovery
The constructed biological heterogeneous graph comprises eight node types and sixteen edge types, as visually detailed in
Figure 1a and
Table 6. During data preprocessing, we identify all applicable metapaths for model learning. Prior to metapath discovery, the dataset is strictly partitioned: only miRNA-disease edges from the training set are retained in the original graph, while validation and test set edges are removed.
We define two metapath categories:
Node-specific metapaths (P(v)): paths starting from target node v and ending with nodes of the same type (example: miRNA-circRNA-miRNA (MCM)).
Common metapaths (P(v, u)): paths connecting heterogeneous node pairs (v and u), serving as bridges between different entity types (example: miRNA-lncRNA-mRNA-disease (MLRD)).
Node-specific metapaths (e.g., MCM) capture semantic interactions between homogeneous nodes, whereas common metapaths (e.g., MLRD) establish interpretable bridges between heterogeneous nodes and guide structural pooling.
We discovered 12 node-specific and 13 common metapaths (
Figure 1b,
Table 6). To enable self-attention mechanisms to autonomously learn metapath contributions, we exhaustively generate all permuted metapath combinations within specified lengths. According to
-decay theory [
26,
77], higher-order structural features can be effectively approximated through low-order neighbors (small
h-hops), as approximation errors decrease exponentially with
h. Thus, metapaths with lengths
suffice to capture sufficient high-order structural information.
5.3. RGCN Node Pre-Encoding Module
Prior to feeding node vectors into RGCN, we apply type-specific linear transformations to project feature vectors of different node types into a common latent space. For each node
of type
T, the transformation is defined as
where
denotes the feature representation of node
v at layer
k,
represents the original feature vector, and
is the learnable weight matrix for type
T nodes.
After a comprehensive evaluation, we select a single-layer RGCN as the pre-encoding module. This shallow architecture focuses on capturing direct first-order neighborhood features while preserving primitive topological information fidelity. As shown in
Figure 1c, RGCN handles heterogeneous edges through relational-aware processing. The update rule for each node
v in RGCN is formulated as
where
denotes the set of all relation types,
represents the neighborhood of node
v under relation
r, and
is the activation function.
5.4. MetaPath Full-Node Encoding
To comprehensively exploit semantic information from metapaths, we design a three-layer nested full-node encoder architecture that progressively processes information through the following stages: individual metapath encoding, intra-class metapath aggregation, and inter-class metapath aggregation.
For encoding individual metapaths, we employ linear transformation layers. The encoded representations of node-specific metapath
and common metapath
are formulated as
where metapaths of the same category share the identical weight matrix
.
After completing individual metapath encoding, we obtain encoded metapaths
for category
P. We then apply graph attention network [
78] to compute weighted sums of all metapath encodings related to target nodes. The aggregated representation for category
P is formulated as
where
denotes the learnable parameter.
As demonstrated by Yang et al. [
25], neighbor attention mechanisms within homogeneous relations are non-essential, where simple mean aggregation achieves comparable effectiveness to attention-based approaches. We therefore formulate an alternative intra-class metapath aggregation as
Finally, we employ an inter-class metapath aggregation layer to integrate semantic information revealed by all metapaths. Unlike MAGNN’s batch-wise additive attention mechanism, we implement sample-wise multiplicative self-attention to aggregate metapath vectors. This design ensures batch independence while enabling adaptive attention weights per sample, thereby enhancing model robustness. Given encoded metapaths
, the inter-class aggregation is computed through
where
denotes the activated output from the linear transformation of
,
are linear projections of
X, and
transforms attention weights
for all metapaths.
5.5. Link Prediction Decoder
For a specific link , we generate node-related metapaths and , along with their common metapath . Through HMPNN’s pre-encoding and full-node metapath encoding, we obtain encoded representations , , and for both nodes and their link. The node encodings and are secondary encodings derived from pre-encoded features, which capture semantic interaction information between homogeneous nodes since the metapaths encoding these nodes connect homogeneous nodes at both ends. The link encoding originates from common metapaths between nodes u and v, meaning these nodes are interconnected through intermediate nodes that form multiple communication pathways, constituting the most critical structural features of the connection.
We present the simplified topological structure schematic diagram of the previously constructed heterogeneous biological entity graph in
Figure 7. This figure is drawn by selecting a specific disease as the source node and a specific miRNA as the target node and trimming most connections in the real dataset to retain only a few. Two potentially related nodes are connected through a pathway composed of multiple intermediate nodes. The source node connects to several hub nodes (e.g., mRNA-type nodes), while most satellite nodes linked to the target node associate with these hubs. All intermediate nodes provide structural insights for the source-target connection. Therefore, intermediate nodes along common metapaths serve as crucial structural features whose embeddings couple the link’s semantic features. This integration of structural and semantic features provides direct evidence for determining node connectivity. For node pairs with missing connections, we complete their metapaths using the nodes themselves, which aligns with biological intuition.
Based on the theoretical framework, we design a specialized decoder for link prediction:
where
f,
, and
denote multilayer perceptrons (MLPs) containing linear layers, layer normalization, and activation functions. The operator ⊙ represents the Hadamard product, and
is a learnable parameter. This decoder enhances link prediction performance by combining primary insights from node-level embeddings (
) with higher-order structural and semantic insights (
).