MUSIGAIN: Adaptive Graph Attention Network for Multi-Relationship Mining in Music Knowledge Graphs

Chen, Mian; Wang, Tinghao; Li, Chunhao; Li, Yuheng

doi:10.3390/electronics14244892

Open AccessArticle

MUSIGAIN: Adaptive Graph Attention Network for Multi-Relationship Mining in Music Knowledge Graphs

¹

School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

Key Laboratory of Internet Information Retrieval of Hainan Province, Haikou 570228, China

³

School of Cyberspace Security (School of Cryptology), Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4892; https://doi.org/10.3390/electronics14244892

Submission received: 14 October 2025 / Revised: 5 December 2025 / Accepted: 10 December 2025 / Published: 12 December 2025

(This article belongs to the Special Issue AI-Driven Data Analytics and Mining)

Download

Browse Figures

Versions Notes

Abstract

With the exponential growth of digital music, efficiently identifying key music relationship nodes in large-scale music knowledge graphs is crucial for enhancing music recommendation, emotion analysis, and genre classification. To address this challenge, we propose MUSIGAIN, a GATv2-based adaptive framework that combines graph robustness metrics with advanced graph neural network mechanisms for multi-relationship mining in heterogeneous music knowledge graphs. MUSIGAIN tackles three fundamental challenges: the prohibitive computational complexity of exact graph-robustness calculations, the limitations of traditional centrality measures in capturing semantic heterogeneity, and the over-smoothing problem in deep graph neural networks. The framework introduces three key innovations: (1) a layer-wise dynamic skipping mechanism that adaptively controls propagation depth based on third-order embedding stability, reducing computation by 30–40% while preventing over-smoothing; (2) the DiGRAF adaptive activation function that enables node-specific nonlinear transformations to capture semantic heterogeneity across different entity types; and (3) ranking-based optimization supervised by graph robustness metrics, focusing on relative importance ordering rather than absolute value prediction. Experimental results on four real-world music knowledge graphs (POP-MKG, ROCK-MKG, JAZZ-MKG, CLASSICAL-MKG) demonstrate that MUSIGAIN consistently outperforms existing methods in Top-5% node identification accuracy, achieving up to 96.78% while maintaining linear scalability to graphs with hundreds of thousands of nodes. MUSIGAIN provides an efficient, accurate, and interpretable solution for key node identification in complex heterogeneous graphs.

Keywords:

music knowledge graphs; multi-relationship mining; GATv2; graph robustness; node importance; adaptive mechanisms

1. Introduction

With the rapid development of music streaming services, the global music industry is undergoing unprecedented digital transformation. According to the IFPI Global Music Report 2025, global paid subscription streaming accounts reached 752 million in 2024, with music catalog sizes reaching tens of millions of tracks [1]. In this massive digital ecosystem, efficiently identifying key music relationship patterns is vital for improving user experience, supporting personalized recommendations, and understanding music evolution trends.

Traditional music analysis methods rely heavily on content-based feature extraction, collaborative filtering, or simple metadata matching [2]. While these approaches have proven useful over the past decades, they face increasing challenges given the complexity and heterogeneity of modern music data. Music data contains rich semantic information spanning multiple modalities: audio features, lyrical content, artist relationships, genre classifications, emotion labels, and cross-domain musical connections. Simple frequency-based metrics fail to capture the intricate structural relationships and semantic dependencies that characterize contemporary music networks [3].

The Music Knowledge Graph (MKG) represents a paradigm shift in organizing and analyzing musical knowledge. By integrating multi-type entities such as songs, artists, albums, genres, and emotion labels along with their semantic relationships, MKGs provide powerful infrastructure for music relationship identification and trend forecasting [4]. However, identifying key music relationship nodes within an MKG remains a significant challenge due to several factors: the massive scale of music data (often millions of entities), the heterogeneous nature of entity types with fundamentally different semantic meanings, the dynamic evolution of musical domains, and the need for interpretable results that can guide recommendation decisions.

This task can be formalized as a node importance estimation problem—a core challenge in graph mining with broad applications in social network analysis, recommendation systems, resource allocation, and knowledge discovery. Traditional node importance estimation methods emphasize centrality measures such as degree centrality, betweenness centrality [5], and PageRank [6]. However, these methods exhibit obvious limitations when dealing with heterogeneous graphs: they treat all node types equally, ignore semantic differences between entity types, and often fail to capture the structural importance of nodes in maintaining network connectivity and information flow.

In recent years, graph robustness-based methods have emerged as a promising approach to assess a node’s impact on network stability [7,8]. Graph robustness metrics measure how the removal of a node affects the overall connectivity and efficiency of the network, providing a structure-aware evaluation of node importance. However, the exact computation of graph robustness has complexity up to

O (N^{5})

, which is computationally infeasible for large-scale MKGs [9]. Even approximate methods typically require

O (N^{2})

or higher complexity, limiting their practical applicability for real-time analysis and dynamic graphs.

The emergence of Graph Neural Networks (GNNs) has revolutionized graph-structured data analysis, achieving remarkable success in tasks such as node classification, link prediction, and graph generation [10,11,12]. GNNs learn node representations by iteratively aggregating information from neighboring nodes, enabling them to capture both local structural patterns and global graph properties. Recent advances in neural network acceleration and optimization [13] have further enabled efficient deployment of deep learning models on resource-constrained platforms, demonstrating the broader trend toward practical, scalable machine learning systems. Several recent works have proposed GNN-based methods for node importance estimation [14,15]. However, these methods often overlook the heterogeneous nature of MKGs, in which different entity types (songs, relationships, artists, albums) have fundamentally different semantic meanings and structural roles.

In MKGs, music relationship nodes face severe class imbalance—they are vastly outnumbered by song nodes, often by a ratio of 1:20 or more. This extreme imbalance creates several challenges for existing GNN methods. First, standard message-passing mechanisms treat all nodes equally during information aggregation, leading to inefficient computation, with most resources spent on less informative song nodes. Second, uniform propagation depth across all nodes can cause over-smoothing [16], where the representations of different nodes become increasingly similar as the network depth increases, reducing the discriminative power needed to identify critical relationship nodes. Third, fixed activation functions such as ReLU and GELU fail to capture the structural and semantic heterogeneity across different node types and local graph patterns, limiting the model’s ability to adapt to the diverse characteristics of MKG entities.

To address these fundamental challenges, this paper proposes MUSIGAIN, a GATv2-based adaptive framework that synergistically combines graph robustness metrics with advanced graph neural network mechanisms. MUSIGAIN introduces three key innovations that work together to enable efficient and accurate node importance estimation in heterogeneous MKGs:

Dynamic Propagation Control. We introduce a layer-wise dynamic skipping mechanism that monitors the third-order embedding difference of each node to detect convergence. Nodes whose representations have stabilized skip further updates, adaptively controlling propagation depth on a per-node basis. This mechanism simultaneously addresses over-smoothing in deep architectures and reduces computational overhead by 30–40%, enabling efficient processing of large-scale graphs.

Semantic-Adaptive Nonlinearity. We incorporate the DiGRAF (Diffeomorphic Graph-Adaptive Activation Function) [17] that learns node-specific activation parameters based on node type, current embedding, and local structural features. Unlike fixed activation functions that apply uniform transformations, DiGRAF enables each node to have a customized nonlinear mapping that reflects its semantic role and structural context, better capturing the heterogeneity inherent in MKGs.

Robustness-Supervised Ranking. Rather than predicting absolute robustness values (which requires expensive eigenvalue computations), we employ ranking-based optimization that focuses on learning the correct relative ordering of node importance. By combining pairwise ranking loss with listwise softmax loss, the model learns to identify structurally critical nodes while avoiding the computational bottleneck of exact robustness calculation.

These innovations are integrated into a unified architecture featuring a Location-Sensing Transformer (LST) for type-specific feature extraction, GATv2-based message passing with attention mechanisms, and a fusion layer that combines spatial and temporal information patterns.

The main contributions of this work are as follows:

We propose MUSIGAIN, a GATv2-based adaptive framework that combines layer-wise dynamic skipping and DiGRAF adaptive activation to enhance both representation capability and computational efficiency in heterogeneous music knowledge graphs. The dynamic skipping mechanism monitors embedding stability to prevent over-smoothing while reducing computation by 30–40%. Additionally, DiGRAF provides node-specific transformations to capture semantic heterogeneity across entity types.
We introduce a ranking-based optimization approach, supervised by graph-robustness metrics, that focuses on relative importance ordering rather than absolute value prediction. This approach efficiently identifies critical music relationship nodes and contributes to discovering music trends while maintaining computational efficiency.
We conduct extensive experiments on four real-world music knowledge graphs across different musical genres (pop, rock, jazz, and classical), showing that MUSIGAIN consistently outperforms strong baselines. The framework achieves a maximum of 96.78% accuracy in Top-5% node identification, with improvements reaching 3.91 percentage points over GATv2, and exhibits favorable scalability for industrial deployment.

The remainder of this paper is organized as follows. Section 2 reviews related work on music knowledge graphs, node importance estimation, and graph neural networks. Section 3 provides the problem definition, including the formal specification of music knowledge graphs and graph robustness metrics. Section 4 details the MUSIGAIN framework architecture. Section 5 presents the experimental setup, and Section 6 analyzes the results. Section 7 discusses key findings and future directions. Finally, Section 8 concludes the paper.

2. Related Work

2.1. Music Knowledge Graphs and Multi-Relationship Mining

Music knowledge graphs have emerged as powerful tools for organizing and analyzing musical innovation [3]. Oramas et al. [3] pioneered the use of knowledge graph technology to generate music recommendation systems, demonstrating the potential of structured knowledge representation for music mining. Unlike traditional music analysis methods that focus primarily on collaborative filtering networks or simple audio feature matching, MKGs integrate multiple entity types and relationship categories to capture the rich semantic structure of musical domains.

Recent advances in music knowledge graph construction have leveraged sophisticated natural language processing and audio analysis techniques. Silva et al. [18] proposed heterogeneous graph neural networks for music emotion recognition, modeling the intrinsically multimodal nature of musical data by integrating information from audio and lyrics. Ding et al. [19] introduced knowledge-embedded music representations for genre classification, demonstrating how external knowledge can enrich audio-based features. Recent work by Li et al. [20] further advances music genre classification using channel-aware convolutional neural networks with adaptive attention mechanisms, achieving state-of-the-art performance on benchmark datasets. Zhang et al. [21] developed hybrid recommendation algorithms that combine music genes with knowledge graph embeddings, showing improved performance over single-modal approaches. Wang et al. [22] extend this line of work by proposing hybrid music recommendation systems that leverage graph neural networks to capture complex user-item-context interactions. The Million Song Dataset [23] has become a cornerstone resource for large-scale music analysis, providing standardized benchmarks for evaluating recommendation algorithms. Recent advances in music emotion recognition [24] and music analysis with GNNs [25] demonstrate the growing importance of deep learning approaches for understanding musical content and structure.

The construction of MKGs typically involves several key steps: entity extraction from music databases, relationship identification between entities, knowledge fusion across multiple data sources, and quality validation. Modern approaches leverage natural language processing techniques to extract structured information from unstructured music texts such as lyrics and reviews. However, once constructed, the challenge shifts to identifying which music relationships within the MKG are most influential or represent emerging music trends.

Music relationship identification in MKGs differs fundamentally from traditional document clustering or topic modeling approaches. In MKGs, music relationships are explicit entities connected through multiple relationship types to songs, artists, albums, and genres [26]. The importance of a relationship is determined not just by the number of associated songs, but by its structural position in the knowledge graph, its connections to influential artists and albums, and its role in bridging different musical domains. This structural perspective on relationship importance motivates the use of graph-based methods rather than purely content-based approaches.

2.2. Node Importance Estimation

Node importance estimation is a fundamental problem in graph mining with broad applications across multiple domains [27]. Traditional approaches are based on various centrality measures that quantify a node’s structural significance. Degree centrality counts the number of connections a node has, betweenness centrality measures how often a node appears on shortest paths between other nodes, and PageRank [6] evaluates importance through a random walk model where importance is propagated through graph edges. Additional classical metrics include eigenvector centrality [28] and Katz centrality [29], which consider the importance of neighboring nodes in determining a node’s significance.

While these classical centrality measures have proven useful in many applications, they suffer from significant limitations when applied to heterogeneous graphs like MKGs. First, they treat all edges equally, ignoring the semantic differences between relationship types. For instance, a “performer” relationship between a song and an artist has a fundamentally different meaning than a “similar to” relationship between two songs, yet degree centrality would count them equally. Second, these measures assign uniform importance to all node types, failing to account for the fact that music relationship nodes and song nodes play fundamentally different roles in the knowledge graph.

Graph robustness-based methods offer an alternative perspective on node importance. Rather than measuring structural position, these approaches evaluate how a node’s removal affects the overall connectivity and efficiency of the network. Common graph robustness metrics include connectivity-based indicators, such as the size of the largest connected component; distance-based indicators, such as network efficiency and average path length; and spectrum-based indicators, such as algebraic connectivity and Laplacian energy.

Among robustness metrics, spectral methods have received particular attention due to their solid theoretical foundations [30]. The weighted spectrum, defined as the sum of all eigenvalues of the graph Laplacian matrix [31], reflects the overall network connectivity and information propagation capability. When a node is removed from the graph, the change in weighted spectrum provides a quantitative measure of that node’s structural importance. However, exact computation of spectral robustness metrics requires eigenvalue decomposition, which has

O (N^{3})

complexity for a single calculation and must be repeated for each node, resulting in overall complexity up to

O (N^{5})

for complete node importance ranking.

Recent advances in node importance estimation have explored machine learning approaches. Lü et al. [32] provide a comprehensive review of vital node identification methods, categorizing approaches by their theoretical foundations and computational complexity. Wang et al. [33] present a comprehensive survey of critical node identification techniques in complex networks, systematically categorizing methods and highlighting challenges in dynamic and higher-order networks. Fan et al. [34] propose using deep reinforcement learning to identify key players in complex networks, demonstrating that learned approaches can outperform traditional centrality measures in certain contexts. Recent work on GNN-based ranking [35] shows that learning to rank actions in graph-structured problems can achieve better generalization than value-based approaches.

Despite the effectiveness of these classical approaches in quantifying node significance, they primarily rely on static structural heuristics and often fail to scale or adapt to heterogeneous knowledge graphs with diverse semantic relations. This limitation has motivated the development of graph neural network (GNN) based methods, which leverage representation learning to capture higher-order dependencies and complex interactions among nodes.

2.3. Graph Neural Networks for Node Importance Estimation

The advent of Graph Neural Networks has opened new possibilities for scalable node importance estimation. GNNs learn distributed representations of nodes by iteratively aggregating and transforming features from neighboring nodes [36]. Early GNN architectures include Graph Convolutional Networks (GCNs), which extend the convolution operation from regular grids to irregular graph structures, and GraphSAGE [11], which introduces an inductive learning framework that can generalize to unseen nodes through neighborhood sampling.

Graph attention mechanisms have further enhanced GNN expressiveness by allowing nodes to assign different weights to different neighbors [37]. The original Graph Attention Networks (GAT) enable adaptive neighbor weighting through learned attention coefficients. However, Brody et al. [12] identified limitations in the original GAT attention mechanism and proposed GATv2, which achieves truly dynamic attention by adjusting the order of operations in attention score computation. Recent applications of GATv2 [38] demonstrate its effectiveness in solving complex combinatorial problems through graph-based deep learning, validating the importance of dynamic attention mechanisms.

Several recent works have proposed GNN-based methods for estimating node importance in knowledge graphs. GENI [14] integrates structural features with external signals such as text content to estimate node importance in knowledge graphs. MULTIIMPORT [15] extends this approach by inferring node importance from multiple input signals, including graph structure, node features, and temporal dynamics. These methods achieve good performance on knowledge graphs with relatively balanced entity type distributions. However, they do not fully exploit robustness metrics and typically treat all nodes equally during message passing, which can lead to inefficiency when applied to highly imbalanced graphs like MKGs.

A fundamental challenge for GNN-based importance estimation in MKGs is the over-smoothing problem [16]. As GNNs stack more layers to capture long-range dependencies, node representations tend to become increasingly similar, reducing the discriminative power needed for fine-grained importance ranking. Rusch et al. [39] provide a comprehensive survey on over-smoothing in GNNs, axiomatically defining it as the exponential convergence of node features and extending the analysis to continuous-time GNNs. Li et al. [40] provide theoretical insights into this phenomenon, showing that repeated averaging operations in deep GNNs lead to convergence to a stationary distribution. Various solutions have been proposed, including DropEdge [41], which randomly drops edges during training, PairNorm [42], which normalizes node features to maintain diversity, and recent novel dropout approaches [43], which specifically target over-smoothing mitigation. Zhang et al. [44] propose channel-attentive graph neural networks that learn adaptive channel-wise message passing to alleviate over-smoothing and achieve state-of-the-art performance on heterophilous graphs.

2.4. Adaptive Mechanisms in Graph Neural Networks

Recent advances in GNN architectures have introduced various adaptive mechanisms to address the limitations of standard message passing. Residual connections and skip connections allow information to bypass intermediate layers, helping to preserve input features and mitigate over-smoothing. Adaptive depth mechanisms dynamically determine the number of layers to apply to each node based on local graph structure or task requirements. For instance, nodes in dense communities might require fewer layers to capture their local structure, while nodes bridging different communities might benefit from deeper propagation.

Heterogeneous graph neural networks explicitly model different node and edge types through type-specific parameters [45]. Schlichtkrull et al. [46] propose relational graph convolutional networks (R-GCN) that use relation-specific weight matrices to handle multi-relational data. Recent advances in heterogeneous GNNs [47] introduce adaptive relation attention mechanisms for improved node classification, demonstrating the importance of learning type-specific attention weights. These methods use relation-specific transformations and attention mechanisms to capture semantic differences between entity types.

Recent work in music-specific GNN applications has shown promising results. Mao et al. [48] propose multistage graph embeddings for music recommendation, demonstrating improved cold-start performance. Korzeniowski et al. [49] use GNNs for artist similarity computation, showing that learned representations outperform traditional collaborative filtering approaches. Ferraro et al. [50] combine audio and text features in a hybrid recommender system for automatic playlist continuation. Recent advances [51] leverage artificial intelligence and knowledge graphs for online music learning platforms, demonstrating the broader applicability of graph-based approaches in music education and analysis. Silva et al. [52] propose parallel convolutional neural networks for music genre classification, achieving improved accuracy through multi-scale feature extraction.

The DiGRAF (Diffeomorphic Graph-Adaptive Activation Function) [17] represents a significant advance in adaptive nonlinearity for GNNs. Unlike fixed activation functions, DiGRAF parameterizes the activation function itself and learns these parameters based on node features, node types, and local structural information. This approach enables the network to apply different nonlinear transformations to different nodes, potentially capturing the heterogeneous nature of complex graphs more effectively. Recent work on adaptive activation functions [53] further demonstrates the benefits of morphing activation functions that can dynamically adjust their shape based on input characteristics, providing additional flexibility for neural network architectures.

Beyond traditional supervised learning, recent multimodal graph transformers [54] have demonstrated the ability to integrate diverse data modalities (text, images, audio) within graph structures, showing promise for complex knowledge graphs. Self-supervised graph learning [55,56] has emerged as a powerful paradigm that learns representations without explicit labels through contrastive learning and graph augmentation strategies. While these approaches offer exciting possibilities for reducing annotation requirements and improving generalization, they have primarily been applied to homogeneous graphs or simple heterogeneous structures. MUSIGAIN differs in that it focuses on node importance estimation in highly heterogeneous music knowledge graphs with explicit robustness-based supervision, addressing a complementary but distinct challenge. Future work could explore integrating self-supervised pretraining with our ranking-based framework to further enhance performance and reduce dependency on labeled robustness scores.

3. Problem Definition

3.1. Music Knowledge Graph

A music knowledge graph can be formally defined as a heterogeneous graph

G = (V, E, ϕ, ψ)

, where V is the set of nodes representing various entities in the music domain, including songs, artists, albums, genres, emotion labels, and music relationships;

E \subseteq V \times V

is the set of edges representing relationships between entities;

ϕ : V \to T_{V}

is the node type mapping function, where

T_{V}

is the set of node types; and

ψ : E \to T_{E}

is the edge type mapping function, where

T_{E}

is the set of edge types.

In MKGs,

T_{V}

typically includes Song, Artist, Album, Genre, Emotion, and MusicRelation, while common edge types include PerformedBy, BelongsToAlbum, HasGenre, HasEmotion, SimilarTo, and Collaborates. For example, a song “Bohemian Rhapsody” might be connected to the artist “Queen” through a PerformedBy edge, to the album “A Night at the Opera” through a BelongsToAlbum edge, to the genre “Progressive Rock” through a HasGenre edge, and to other songs through SimilarTo edges based on acoustic or semantic similarity.

The heterogeneous nature of MKGs distinguishes them from traditional homogeneous graphs [27]. Different node types have distinct semantic meanings and play different roles in representing musical knowledge. For instance, song nodes represent specific musical works with concrete audio features and lyrics, relationship nodes represent general musical concepts or patterns that connect multiple songs, and artist nodes represent individuals or groups who create music. This heterogeneity must be explicitly modeled [26] to accurately capture node importance.

3.2. Node Importance Estimation Task

Given a music knowledge graph

G = (V, E, ϕ, ψ)

and a target node type set

T_{target} \subseteq T_{V}

(typically

T_{target} = {MusicRelation}

for music relationship mining), the goal of node importance estimation is to learn a scoring function

s : V \to R

such that for any two nodes

v_{i}, v_{j} \in V

where

ϕ (v_{i}), ϕ (v_{j}) \in T_{target}

, if

v_{i}

is structurally more important than

v_{j}

, then

s (v_{i}) > s (v_{j})

.

The key challenge is to define and compute structural importance in a way that: (1) captures the multifaceted nature of importance in heterogeneous graphs, where a node’s importance depends not just on its connections but also on its type and the types of its neighbors; (2) scales efficiently to large graphs with hundreds of thousands of nodes, requiring computational complexity that is practical for real-world applications; (3) provides interpretable results that align with domain expertise, so that music professionals can understand and trust the rankings; and (4) generalizes to unseen nodes in an inductive setting [11], allowing the model to evaluate new songs and relationships as they are added to the knowledge graph.

3.3. Graph Robustness Metrics

Graph robustness measures the network’s ability to maintain functionality when facing perturbations or targeted attacks. Intuitively, a network is considered robust if it can maintain connectivity and efficient information flow despite the removal of nodes or edges. This property is of particular importance in music knowledge graphs (MKGs), where the removal of critical relationship nodes may significantly affect the connectivity between songs, artists, and genres.

We adopt the weighted spectrum as our primary robustness metric, which is defined as the sum of all eigenvalues of the normalized graph Laplacian matrix:

Ws (G) = \sum_{i = 1}^{N} λ_{i},

(1)

where

λ_{i}

is the i-th eigenvalue of the normalized Laplacian matrix

L = D^{- 1 / 2} (D - A) D^{- 1 / 2}

, with A being the adjacency matrix and D being the degree matrix. This metric reflects the overall connectivity and diffusion capacity of the graph.

When a node v is removed from the graph, we can measure its importance by the change in weighted spectrum:

Δ Ws (v) = Ws (G) - Ws (G ∖ {v}),

(2)

where

G ∖ {v}

denotes the graph with node v and all its incident edges removed. Nodes with larger

Δ Ws (v)

values are more structurally critical, as their removal causes greater degradation in network connectivity and information flow capability.

Beyond the weighted spectrum, robustness can also be analyzed through connectivity and efficiency metrics. One common indicator is network efficiency, which evaluates the average inverse shortest-path distance across all node pairs:

E (G) = \frac{1}{N (N - 1)} \sum_{i \neq j} \frac{1}{d (i, j)},

(3)

where

d (i, j)

is the shortest path length between nodes i and j. A robust graph maintains high efficiency even under node removal, whereas a fragile graph exhibits sharp efficiency drops.

Another perspective is provided by the algebraic connectivity, defined as the second smallest eigenvalue of the Laplacian matrix:

α (G) = λ_{2} (L) .

(4)

This value reflects how well connected the graph is: higher

α (G)

indicates stronger global connectivity, while lower values reveal vulnerability to fragmentation. In practice, combining

Δ Ws (v)

with

α (G)

and

E (G)

allows for a comprehensive evaluation of node importance, capturing both spectral properties and topological resilience.

4. MUSIGAIN Framework

4.1. Overall Architecture

The MUSIGAIN framework is designed to identify key nodes in heterogeneous music knowledge graphs by leveraging a multi-stage processing pipeline. At its core, the architecture integrates multiple specialized modules that handle type-specific features, attention-based convolutions, and memory aggregation to efficiently capture multi-relationships across nodes.

As shown in Figure 1, the framework processes heterogeneous music graphs through a carefully orchestrated multi-stage pipeline with explicit information flow and module interactions.

Information Flow and Module Interactions. The processing pipeline operates as follows:

Stage 1: Type-Specific Initialization (Section 4.3 and Section 4.4). Raw node features

x_{v}

are first processed by the LST Network, which transforms heterogeneous inputs into unified embeddings

h_{v}^{(0)} \in R^{d}

. The LST encoder uses deformable attention to create a 2D feature grid, capturing spatial patterns within feature vectors. The decoder then refines these patterns through cross-attention, producing initial representations that preserve type-specific semantics while enabling downstream processing in a shared embedding space.

Stage 2: Iterative Message Passing with Adaptive Control (Section 4.5). For layers

l = 1, \dots, L

, each node v undergoes: (1) GATv2 attention computation:

e_{v u}^{(l)} = {\vec{a}}^{T} LeakyReLU (W^{(l)} [h_{v}^{(l)} ∥ h_{u}^{(l)}])

for all neighbors

u \in N (v)

, producing attention weights

α_{v u}^{(l)}

via softmax normalization; (2) Neighborhood aggregation:

{\bar{h}}_{v}^{(l + 1)} = \sum_{u \in N (v) \cup {v}} α_{v u}^{(l)} W^{(l)} h_{u}^{(l)}

; (3) DiGRAF adaptive activation:

h_{v}^{(l + 1)} = σ_{θ_{v}^{(l)}} ({\bar{h}}_{v}^{(l + 1)})

, where parameters

θ_{v}^{(l)}

are computed from current embeddings, node types, and local structure; (4) Dynamic skip decision: if

l \geq 4

and

Δ^{3} h_{v}^{(l)} < τ^{(l)}

, node v is marked stable and skips updates in subsequent layers, with

h_{v}^{(l^{'})} = h_{v}^{(l)}

for

l^{'} > l

.

The dynamic skip mechanism creates a control flow where different nodes effectively experience different network depths: critical relationship nodes typically propagate through all L layers, while peripheral song nodes may stabilize and skip after 4–5 layers. This adaptive depth control is key to preventing over-smoothing while maintaining computational efficiency.

Stage 3: Memory-Augmented Aggregation (Section 4.5). In parallel with GATv2 layers, an attention-based aggregation module maintains a memory matrix

M \in R^{K \times d}

that captures global graph patterns. At each layer, node embeddings are scored via

e_{v} = f (h_{v}^{(l)})

and weighted by softmax-normalized attention

a_{v}

. The global representation

r = \sum_{v} a_{v} h_{v}^{(l)}

is used to update M through a gating mechanism, allowing the model to accumulate historical information across layers.

Stage 4: Fusion and Ranking (Section 4.6). The final node embedding

h_{v}^{(L)}

from GATv2 layers and the memory-augmented representation are concatenated and passed through an MLP regression head to produce importance scores

s_{v}

. During training, these scores are optimized via a composite loss:

L_{total} = L_{rank} + β L_{list} + λ_{cls} L_{cls} + γ L_{reg}

, where pairwise ranking loss enforces correct ordering, listwise loss captures global ranking structure, classification loss maintains type discrimination, and regularization prevents overfitting.

Training and Inference Control Logic. During training, all modules are updated end-to-end via backpropagation. The dynamic skip decisions are made in the forward pass based on embedding stability, creating a dynamic computational graph where gradient flow is automatically truncated for stable nodes. During inference, the model processes new nodes inductively: LST extracts initial features, GATv2 layers aggregate from existing neighbors (with dynamic skipping applied), and the regression head produces importance scores without requiring graph-level recomputation. This inductive capability enables real-time evaluation of newly added songs and relationships.

In the following sections, we provide detailed mathematical formulations for each module, with particular focus on how DiGRAF, LST, and dynamic skipping work together to enhance node importance estimation.

4.2. DiGRAF Adaptive Activation

Traditional GNNs use fixed activation functions such as ReLU, LeakyReLU, or GELU, which apply the same nonlinear transformation to all nodes regardless of their semantic type or structural context. While effective in homogeneous graphs, this uniform treatment fails to capture the semantic and structural heterogeneity present in music knowledge graphs (MKGs), where songs, artists, genres, and relationship nodes play fundamentally different roles. To better address this challenge, we incorporate the DiGRAF (Diffeomorphic Graph-Adaptive Activation Function) [17], which enables node-specific adaptive nonlinearities.

Figure 2 illustrates how DiGRAF is integrated into our GNN architecture. After the standard GNN layer computation produces an intermediate embedding

{\bar{h}}^{(l)}

, DiGRAF applies a node-specific adaptive transformation parameterized by learnable differential equations.

Formally, DiGRAF defines the nonlinear mapping

σ_{θ}

through an ODE flow:

σ_{θ} (x) = x + \int_{0}^{1} f_{θ} (x, t) d t,

(5)

where

f_{θ} (x, t)

is a learnable vector field parameterized by

θ

.

Physical Meaning and Design Motivation. This ODE formulation has clear geometric and physical interpretations. The integral

\int_{0}^{1} f_{θ} (x, t) d t

represents a continuous transformation that “flows” the input x through a learned vector field over pseudo-time

t \in [0, 1]

. Unlike discrete activations (e.g., ReLU) that apply a single fixed transformation, the ODE flow allows the activation to smoothly deform the feature space in a node-specific manner. The pseudo-time parameter t acts as a regularization mechanism: at

t = 0

, the transformation begins from the identity (

σ_{θ} (x) \approx x

), and as t increases to 1, the learned vector field

f_{θ}

progressively applies nonlinear deformations. This continuous formulation guarantees diffeomorphism (invertibility and smoothness), ensuring stable training and preventing gradient vanishing or explosion issues that often arise in deep GNNs. From a dynamical systems perspective,

f_{θ} (x, t)

defines how node representations evolve, with different nodes following different trajectories based on their semantic types and structural contexts. This design is motivated by neural ODEs [57], which have shown that continuous-depth models can learn more expressive transformations than discrete-layer networks while maintaining numerical stability through adaptive ODE solvers.

The node-specific parameters

θ_{v}^{(l)}

are computed adaptively at each layer:

θ_{v}^{(l)} = MLP ([h_{v}^{(l)} ∥ TypeEmb (ϕ (v)) ∥ NeighFeat (v)]),

(6)

where

h_{v}^{(l)}

is the current node embedding,

TypeEmb (ϕ (v))

encodes the node type, and

NeighFeat (v)

aggregates structural statistics of the local neighborhood (e.g., degree, clustering coefficient).

The GNN update rule with DiGRAF becomes the following:

h_{v}^{(l + 1)} = σ_{θ_{v}^{(l)}} (\sum_{u \in N (v) \cup {v}} α_{v u}^{(l)} W^{(l)} h_{u}^{(l)}),

(7)

where

α_{v u}^{(l)}

are attention coefficients from GATv2 and

W^{(l)}

is the trainable weight matrix.

To better understand DiGRAF’s adaptive behavior, consider its Jacobian matrix:

J_{σ} (x) = \frac{\partial σ_{θ} (x)}{\partial x} = I + \int_{0}^{1} \frac{\partial f_{θ} (x, t)}{\partial x} d t .

(8)

The diffeomorphic property ensures

det (J_{σ} (x)) \neq 0

, which means each transformation is invertible and preserves structural information during propagation. In contrast, fixed activations (e.g., ReLU) correspond to piecewise-linear mappings with non-invertible regions (e.g., zero-gradient for negative values).

Moreover, the adaptive activation can approximate classical nonlinearities under specific parameterizations. For example:

σ_{θ} (x) \approx max (0, x) if f_{θ} (x, t) is parameterized as a rectified linear field .

(9)

This shows DiGRAF generalizes standard functions, while providing the flexibility to learn node-specific nonlinear mappings tailored to the heterogeneous MKG. By integrating DiGRAF into MUSIGAIN, each node is equipped with a customized activation function that reflects its semantic role and structural context, thereby preserving heterogeneity and enhancing representation power in deep GNN architectures.

Interpretability of Learned Nonlinearities. To illustrate how DiGRAF adapts to different node types, we analyze the learned activation patterns in POP-MKG. Song nodes typically learn smooth, sigmoid-like activations that preserve continuous audio feature representations, with average Jacobian determinants near 1.2, indicating moderate nonlinear transformation. In contrast, music relationship nodes exhibit sharper, ReLU-like activations with higher curvature (average Jacobian determinants of 2.1), enabling them to capture discrete categorical distinctions between different relationship types. Genre nodes demonstrate intermediate behavior with Jacobian determinants around 1.6, reflecting their role as semantic bridges between songs and relationships. This adaptive behavior emerges naturally from the training process without explicit supervision, demonstrating DiGRAF’s ability to discover type-specific nonlinearities that align with the semantic roles of different entity types in the knowledge graph.

4.3. Type-Specific Feature Extraction

Different node types in MKGs require different initialization strategies to capture their semantic and structural characteristics. As shown in Figure 3, each node type undergoes a customized feature extraction process where relevant local features are first gathered. For instance, for song nodes, we collect textual features from their titles and metadata like release dates, alongside structural features such as graph degrees. Similarly, for artists, albums, genres, and relations, we aggregate corresponding features like names and graph degrees.

These collected, heterogeneous local features for each node are then processed by a powerful and unified encoder, the Location-Sensing Transformer Network (LST), to generate high-quality initial embeddings. This approach enables a single, expressive model to handle diverse feature types across the entire graph.

Formally, the initialization process for each node type can be represented as passing a concatenation of its raw local features into the LST encoder:

h_{song}^{(0)} = LST (Title \oplus ReleaseDate \oplus GraphDegrees)

(10)

h_{artist}^{(0)} = LST (Name \oplus GraphDegrees)

(11)

h_{album}^{(0)} = LST (Name \oplus GraphDegrees)

(12)

where ⊕ denotes the concatenation of raw features. This unified approach, where a specialized encoder

f_{ϕ (v)} = LST

is applied to type-specific raw inputs

x_{v}

, can be summarized as follows:

h_{v}^{(0)} = f_{ϕ (v)} (x_{v})

(13)

The architecture and principles of the LST are detailed in the following section.

4.4. Location-Sensing Transformer Network

The Location-Sensing Transformer Network, illustrated in Figure 4, serves as our primary feature encoder. It is designed to extract deep, context-aware representations from the concatenated raw features of each node by uniquely interpreting the feature vector with a spatial bias. This is particularly relevant to our theme of mining music relationships, as the interplay and relative “position” of different features can signify a node’s importance. The LST consists of an encoder-decoder architecture.

The encoder first processes the initial feature vector

X_{m}

of a node. To retain sequential information within the feature vector, positional encodings

P_{m}

are added to the input:

X_{i n} = X_{m} + P_{m}

(14)

The core of the encoder is the Multi-Head Deformable Self-Attention mechanism. Unlike standard self-attention, deformable attention focuses on a small, learnable set of key sampling points, allowing the model to capture the most salient features efficiently. The output is then passed through a residual connection and layer normalization, followed by a feed-forward network (BC-FFN) to produce the encoded representation

X^{'}

. This process can be formulated as:

X_{a t t n} = LayerNorm (X_{i n} + MultiHeadDeformAttn (X_{i n}))

(15)

X^{'} = LayerNorm (X_{a t t n} + BC - FFN (X_{a t t n}))

(16)

The key innovation of LST lies in its location-sensing core. The encoder’s output sequence

X^{'}

is reshaped into a 2D Feature Grid, G, of size

S \times S

. This step endows the network with its “location-sensing” capability by creating an explicit spatial map of features, enabling the capture of local patterns and interactions.

G = {Reshape}_{S \times S} (X^{'})

(17)

The decoder then refines the representation using Multi-Head Deformable Cross-Attention. It uses a set of learnable query vectors,

Q_{L}

, to attend to the flattened feature grid, which serves as the key and value. This allows the decoder to selectively distill the most critical spatial feature patterns from the grid. This cross-attention step, also followed by a residual connection, layer normalization, and a BC-FFN, produces the final polished node embedding

X_{d}

. The decoder operations are as follows:

X_{c r o s s} = LayerNorm (Q_{L} + MultiHeadDeformCrossAttn (Q_{L}, Flatten (G), Flatten (G)))

(18)

X_{d} = LayerNorm (X_{c r o s s} + BC - FFN (X_{c r o s s}))

(19)

This final output

X_{d}

serves as the initial node representation

h_{v}^{(0)}

for the subsequent layers of our GNN.

4.5. GATv2-Based Message Passing with Dynamic Skipping

After initialization, we apply multiple layers of graph attention to aggregate neighborhood information and refine node representations. We adopt GATv2 [12] as the base attention mechanism due to its improved expressiveness compared to the original GAT, since it relaxes the weight-sharing constraint and allows more flexible feature interactions.

For a node v at layer l, the GATv2 update rule is defined as follows:

h_{v}^{(l + 1)} = σ (\sum_{u \in N (v) \cup {v}} α_{v u}^{(l)} W^{(l)} h_{u}^{(l)}),

(20)

where

h_{v}^{(l)} \in R^{d}

is the embedding of node v at layer l,

W^{(l)} \in R^{d^{'} \times d}

is the trainable transformation matrix,

α_{v u}^{(l)}

is the attention coefficient between node v and its neighbor u,

N (v)

denotes the set of neighbors of v, and

σ (\cdot)

is a nonlinear activation function (e.g., DiGRAF or ReLU).

The attention scores are computed as follows:

e_{v u}^{(l)} = {\vec{a}}^{T} LeakyReLU (W^{(l)} [h_{v}^{(l)} ∥ h_{u}^{(l)}]),

(21)

where

\vec{a} \in R^{d^{'}}

is a learnable attention vector,

[h_{v}^{(l)} ∥ h_{u}^{(l)}]

denotes the concatenation of embeddings of v and u, and

LeakyReLU (\cdot)

introduces nonlinearity. The attention coefficients are then normalized by a softmax function across all neighbors:

α_{v u}^{(l)} = \frac{exp (e_{v u}^{(l)})}{\sum_{u^{'} \in N (v) \cup {v}} exp (e_{v u^{'}}^{(l)})},

(22)

which ensures that

\sum_{u} α_{v u}^{(l)} = 1

for each node v.

To address the over-smoothing problem [16] in deep GNNs and improve computational efficiency, we introduce a layer-wise dynamic skipping mechanism. The idea is to monitor the change of embeddings across layers and skip redundant updates once the representation becomes stable. We quantify stability by defining the third-order embedding difference for node v at layer l as follows:

Δ^{3} h_{v}^{(l)} = {∥ h_{v}^{(l)} - 3 h_{v}^{(l - 1)} + 3 h_{v}^{(l - 2)} - h_{v}^{(l - 3)} ∥}_{2},

(23)

where

{∥ \cdot ∥}_{2}

denotes the

L_{2}

norm. This measure captures higher-order smoothness by examining the rate of change in the embedding updates. Specifically, if

Δ^{3} h_{v}^{(l)}

is small, the embedding sequence

{h_{v}^{(l)}}

has converged locally, indicating that further propagation would provide diminishing returns.

Theoretical Justification for Third-Order Difference. The choice of the third-order difference is motivated by numerical analysis theory for detecting convergence in iterative processes. First-order differences

∥ h_{v}^{(l)} - h_{v}^{(l - 1)} ∥

only measure immediate changes and cannot distinguish between steady convergence and oscillation. Second-order differences

∥ h_{v}^{(l)} - 2 h_{v}^{(l - 1)} + h_{v}^{(l - 2)} ∥

detect acceleration but may trigger premature stopping during temporary slowdowns in otherwise productive updates. Third-order differences capture the rate of change of acceleration (jerk), providing a more robust indicator of true convergence by detecting when the update trajectory has stabilized in both magnitude and direction. Empirically, we compared convergence detection using first-, second-, and third-order differences on our validation set: first-order led to 15% premature stopping (nodes stopped before reaching optimal representations), second-order showed 8% premature stopping, while third-order achieved only 2% premature stopping with 97% of the nodes converging to within 0.01 of their final representations. For non-stationary or oscillating cases, the adaptive threshold

τ^{(l)}

(computed from the range of

Δ^{3}

values across all nodes) automatically adjusts to accommodate global dynamics, preventing premature cutoff during systematic updates while still detecting local convergence. Nodes in local plateaus that require further propagation for long-range dependencies will exhibit non-zero third-order differences as they eventually escape the plateau, ensuring continued updates.

Starting from layer

l \geq 4

, we compute an adaptive threshold:

τ^{(l)} = α (max_{u \in V} Δ^{3} h_{u}^{(l)} - min_{u \in V} Δ^{3} h_{u}^{(l)}),

(24)

where V is the set of all nodes, and

α \in (0, 1)

is a scaling hyperparameter controlling sensitivity. A node v is considered stable if

Δ^{3} h_{v}^{(l)} < τ^{(l)} .

(25)

Stable nodes skip further updates at layer l, effectively reducing unnecessary computation while preventing over-smoothing of their embeddings. This dynamic skipping mechanism allows MUSIGAIN to adaptively determine the required depth of message passing per node, improving both efficiency and representation quality.

In addition, to emphasize salient words/neighbors and suppress noise, we append a learnable attention-allocation layer on top of GATv2 outputs, as shown in Figure 5. Let

{c_{i}}_{i = 1}^{t}

denote the token/node features from the encoder, where each

c_{i} \in R^{d}

. A scorer

f (\cdot)

produces a relevance score for every element,

e_{t} = f (c_{t}),

(26)

which is normalized via a softmax to obtain attention weights

a_{t} = \frac{exp (e_{t})}{\sum_{i = 1}^{t} exp (e_{i})}, a_{t} \in [0, 1], \sum_{t = 1}^{t} a_{t} = 1,

(27)

and the sentence-/relation-level representation is computed as a convex combination

r = \sum_{i = 1}^{t} a_{i} c_{i} \in R^{d} .

(28)

where t is the sequence (or neighborhood) length and d is the feature dimension. In practice,

f (\cdot)

can be a one-layer MLP (e.g.,

e_{t} = v^{⊤} tanh (W c_{t} + b)

) or a bilinear scorer; its parameters are learned jointly with the rest of the model. This attention layer complements the heterogeneity-aware message passing of GATv2 by re-weighting fine-grained evidence before it reaches the prediction head, thereby improving both interpretability (through

{a_{i}}

) and robustness to irrelevant features.

4.6. Regression and Ranking-Based Optimization

The final stage of MUSIGAIN transforms high-dimensional node embeddings into scalar importance scores and optimizes them through ranking-based supervision. This module consists of two tightly coupled components: a regression head that produces continuous scores and a ranking loss that focuses on relative ordering rather than absolute values.

4.6.1. Regression Head

For each node v, the final embedding

h_{v}^{(L)} \in R^{d}

from the L-th GNN layer is mapped to a scalar score through a lightweight multi-layer perceptron (MLP):

z_{v}^{(m)} = σ (W^{(m)} z_{v}^{(m - 1)} + b^{(m)}),

(29)

where

z_{v}^{(0)} = h_{v}^{(L)}

and

m \in {1, \dots, M}

indexes the hidden layers. The final score is obtained via the following:

s_{v} = W^{out} z_{v}^{(M)} + b^{out},

(30)

with

L_{2}

regularization applied to prevent overfitting:

L_{reg} = \sum_{m = 1}^{M} ∥ W^{(m)} ∥_{2}^{2} + {∥ W^{out} ∥}_{2}^{2}

. Scores are normalized using sigmoid:

{\hat{s}}_{v} = 1 / (1 + exp (- s_{v}))

for cross-dataset comparability.

4.6.2. Ranking-Based Loss

Rather than predicting exact robustness values, we optimize for correct relative ordering using a combination of pairwise and listwise ranking objectives. Given ground-truth robustness values

{y_{v}}

, the pairwise ranking loss enforces margin-based ordering:

L_{rank} = \sum_{(v_{i}, v_{j}) \in P} max (0, - (s_{i} - s_{j}) \cdot sign (y_{i} - y_{j}) + m),

(31)

where

P

is the set of sampled node pairs, and

m > 0

is a margin hyperparameter. This encourages correctly ordered pairs to be separated by at least m units.

To capture the global ranking structure, we add a listwise softmax loss that minimizes the KL divergence between the ground-truth and predicted ranking distributions:

L_{list} = - \sum_{v \in V} \frac{exp (y_{v})}{\sum_{u \in V} exp (y_{u})} log \frac{exp (s_{v})}{\sum_{u \in V} exp (s_{u})} .

(32)

An auxiliary classification loss prevents feature collapse across heterogeneous node types:

L_{cls} = - \sum_{v \in V} log P (y_{v}^{type} | h_{v}^{(L)}),

(33)

where

P (y_{v}^{type} | h_{v}^{(L)}) = exp (W_{y_{v}^{type}}^{cls} h_{v}^{(L)}) / \sum_{c} exp (W_{c}^{cls} h_{v}^{(L)})

.

The overall training objective integrates all components:

L_{total} = L_{rank} + β L_{list} + λ_{cls} L_{cls} + γ L_{reg},

(34)

where

β

,

λ_{cls}

, and

γ

balance the contributions of listwise ranking, type discrimination, and weight decay. This joint objective ensures that learned scores maintain correct relative orderings, align with global ranking distributions, and preserve semantic distinctiveness across node types.

5. Experimental Setup

5.1. Datasets

We constructed four heterogeneous music knowledge graphs from Last.fm and Million Song Dataset, covering different musical genres, to evaluate MUSIGAIN’s performance across diverse application scenarios. The statistics of the four MKGs as shown in Table 1.

The POP-MKG (Pop Music) knowledge graph focuses on popular music, including pop rock, electronic pop, and modern pop artists. It contains 5823 songs, 1156 artists, 2287 albums, 2103 music relationships, and 945 genre labels, resulting in a total of 12,314 nodes and 19,567 edges. The ROCK-MKG (Rock Music) graph covers rock music methodologies, subgenres, and representative bands, with 24,612 songs and 50,127 total nodes. The JAZZ-MKG (Jazz Music) graph focuses on jazz improvisation, style evolution, and famous musicians, containing 45,328 songs and 89,743 nodes. The CLASSICAL-MKG (Classical Music) is the largest graph in our collection, with 218,956 works, 398,267 nodes, and 735,892 edges.

The construction process involved three main steps: (1) Data Collection and Entity Extraction from Last.fm API and Million Song Dataset with extensive data cleaning; (2) Music Relationship and Genre Extraction using OpenAI GPT-4 API with manual filtering by music domain experts; (3) Knowledge Graph Construction by linking all entities according to a predefined schema, including relationship types such as PerformedBy, BelongsToAlbum, HasGenre, HasEmotion, SimilarTo, and Collaborates.

Data Validation and Quality Control. To ensure the reliability of GPT-4-extracted relationships, we implemented a rigorous validation process. Three music domain experts (with 5+ years of experience in musicology) independently reviewed a random sample of 500 extracted relationships from each dataset. We measured inter-annotator agreement using Fleiss’ kappa, achieving

κ = 0.82

for POP-MKG,

κ = 0.79

for ROCK-MKG,

κ = 0.81

for JAZZ-MKG, and

κ = 0.78

for CLASSICAL-MKG, indicating substantial agreement. Relationships with disagreement were resolved through discussion and majority voting. Additionally, we compared GPT-4 extractions against existing music ontologies (MusicBrainz, Discogs) where available, finding 89% consistency. To mitigate potential GPT-4 biases, we employed diverse prompt templates and cross-validated results across multiple API calls. The final datasets only include relationships that passed both expert validation and consistency checks, ensuring high-quality ground truth for model training and evaluation.

5.2. Evaluation Metrics

We evaluate model performance using Top-K% Node Identification Accuracy, which is defined as the fraction of correctly identified nodes in the top K% most important nodes:

Accuracy @ K % = \frac{| {Predicted}_{TopK} \cap {True}_{TopK} |}{| {True}_{TopK} |}

(35)

We report results for

K = 5 %

, which evaluates whether the model correctly identifies the top 5% most important music relationship nodes. This metric directly measures the practical utility of the method for identifying music trends, as music platforms and analysts typically focus on the most important relationships rather than complete rankings.

Justification for Top-5% Threshold. The choice of 5% is motivated by both domain-specific requirements and empirical analysis. First, music industry practitioners typically focus on a small subset of highly influential relationships for trend analysis and recommendation systems—interviews with three music platform analysts revealed that they monitor approximately 3–7% of total relationships for strategic decisions, with 5% being the median. Second, our analysis of ground-truth robustness scores shows a natural gap in the distribution: the top 5% of nodes have robustness scores > 25.0, while the next tier (5–10%) has scores in the range [22.5, 25.0], indicating a meaningful distinction. Third, from a statistical perspective, 5% provides a sufficient sample size for reliable evaluation (105 nodes in POP-MKG, 432 nodes in ROCK-MKG) while focusing on the most critical nodes.

Table 2 shows MUSIGAIN’s performance across different K% thresholds. The model maintains strong performance across all thresholds (89–98%), demonstrating robustness to threshold selection. Notably, Top-1% shows slightly lower accuracy (88.57–95.24%) due to the extreme difficulty of identifying the single most critical node, while Top-10% and Top-20% show higher accuracy (92.34–97.56%) as the task becomes easier with larger sets. The 5% threshold represents a balanced choice that is neither too restrictive (like 1%) nor too lenient (like 20%), providing meaningful evaluation of the model’s ability to identify truly critical nodes. The consistent ranking of methods across different thresholds (MUSIGAIN > GATv2 > HAN > GENI for all K%) validates that our conclusions are not artifacts of threshold selection.

While metrics such as AUC-ROC and precision-recall curves are valuable for binary classification tasks, our node importance estimation problem is fundamentally a ranking task rather than a classification problem. The ground truth is a continuous importance score (graph robustness metric) rather than binary labels, making Top-K% accuracy more appropriate as it directly evaluates ranking quality. To provide additional ranking quality assessment, we also report Normalized Discounted Cumulative Gain (NDCG@K) in our ablation studies, which measures the quality of the entire ranking by considering both relevance and position. NDCG is particularly suitable for our task as it accounts for the graded relevance of nodes based on their continuous robustness scores. Additionally, we report runtime efficiency in Section 6 through scalability analysis (Table 6), demonstrating linear scaling behavior across graphs ranging from 12 K to 398 K nodes, which validates our scalability claims with concrete timing measurements across varying graph sizes.

5.3. Baseline Methods

We compare MUSIGAIN against the following baselines:

Degree Centrality [5]: Defines a node’s importance by the number of edges incident to it, directly reflecting its immediate connectivity, local influence, and structural prominence in the network.
Betweenness Centrality [58]: Measures a node’s significance by the proportion of shortest paths between node pairs that pass through it, thereby highlighting nodes that act as critical bridges for efficient information flow.
PageRank [6]: Computes node importance through a random-walk diffusion process, assigning higher scores to nodes frequently linked by other high-scoring nodes and capturing their global authority in the network.
GCNII [59]: Extends the classical GCN with residual connections and identity mapping, effectively mitigating over-smoothing, stabilizing deeper propagation, and enabling training of very deep architectures.
GraphSAGE [11]: Learns aggregation functions over sampled neighborhoods in an inductive manner, supporting efficient representation learning on large-scale graphs and enabling generalization to unseen nodes.
GIN [60]: Employs sum aggregation combined with multi-layer perceptrons (MLPs) to maximize discriminative power, achieving expressiveness equivalent to the Weisfeiler–Lehman test for graph isomorphism tasks.
GATv2 [12]: Improves upon the original GAT by relaxing weight-sharing constraints in attention computation, thereby allowing more expressive, adaptive, and flexible modeling of pairwise node interactions.
GENI [14]: Integrates structural features with textual attributes to estimate node importance, making it effective for heterogeneous graphs with rich semantic content and diverse relationship patterns.
MULTIIMPORT [15]: Leverages multiple input signals including structure, semantics, and contextual cues to infer node importance, producing more robust and comprehensive importance estimations overall.

To address concerns about heterogeneous graph methods, we adapted HAN (Heterogeneous Graph Attention Network) [26] and R-GCN (Relational Graph Convolutional Network) [46] for our node importance ranking task. While these methods were originally designed for node classification and link prediction, respectively, we modified their output layers to produce importance scores and trained them with our ranking-based loss function. This adaptation allows fair comparison under identical task settings. Additionally, we include MEIRec [61], a heterogeneous graph method for recommendation, adapted similarly for ranking. The results are presented in Table 3 and demonstrate that while these methods perform reasonably well on heterogeneous graphs, they do not match MUSIGAIN’s performance, validating that our improvements stem from architectural innovations rather than simply handling heterogeneity.

5.4. Implementation Details

All experiments were conducted on a server equipped with NVIDIA GeForce RTX 3090 GPUs (24 GB memory) running Ubuntu 22.04. Models were implemented in PyTorch 1.12 with PyTorch Geometric 2.1 for graph operations. We used the Adam optimizer with a learning rate of 0.001, weight decay of

5 \times 10^{- 4}

, and a batch size of 256. Models were trained for 200 epochs with early stopping based on validation performance (patience of 20 epochs) to prevent overfitting. The datasets were split into training (80%), validation (10%), and test (10%) sets using stratified sampling to maintain class balance across splits.

The node embedding module used 4 GATv2 layers with a hidden dimension of 256 and 8 attention heads per layer. The dynamic skipping threshold parameter

α

was determined through systematic hyperparameter tuning on the validation set. We tested

α

values in the range [0.1, 0.5] with increments of 0.05, evaluating Top-5% accuracy on all four datasets. The optimal value of

α = 0.3

was selected as it achieved the best average performance across all datasets (95.67% average accuracy), balancing between over-smoothing prevention (lower

α

values led to accuracy drops of 2–3% due to excessive propagation) and sufficient information aggregation (higher

α

values above 0.35 caused premature stopping with 1–2% accuracy reduction). This setting proved robust across different graph scales and was therefore used consistently for all experiments. The regression module consisted of 3 fully connected layers with dimensions [256, 128, 1] and ReLU activations between layers. The ranking loss margin m was set to 0.5, and the classification loss weight

λ_{cls}

was 0.1. For each MKG, we computed approximate graph robustness scores for all music relationship nodes using a sampling-based method with 100 random node removals to estimate the change in network efficiency. The choice of 100 samples was determined through convergence analysis: we compared sample sizes of 50, 100, 200, and 500, finding that 100 samples achieved stable robustness estimates (variance < 0.01) while maintaining computational efficiency. Larger sample sizes (200+) provided negligible improvement (<0.3% accuracy gain) at 2–5× computational cost, making 100 samples the optimal trade-off for our application. This sampling approach provides supervision while avoiding the computational cost of exact eigenvalue calculations.

6. Experimental Results

6.1. Overall Performance

The comprehensive performance evaluation across all four music knowledge graphs reveals significant advantages of the MUSIGAIN framework, as shown in Table 3. MUSIGAIN consistently achieves superior Top-5% node identification accuracy, with performance ranging from 91.23% on ROCK-MKG to an impressive 96.78% on POP-MKG.

Traditional centrality measures exhibit fundamental limitations, with Degree Centrality achieving accuracies between 66.34% and 69.78%, while PageRank performs slightly better at 72.78% to 75.23%. The consistently poor performance highlights the inadequacy of treating heterogeneous music knowledge graphs as homogeneous structures.

Learning-based approaches show clear advantages. GATv2 stands out with remarkably high performance, achieving 94.89% on POP-MKG, though its performance degradation on CLASSICAL-MKG (88.23%) reveals scalability challenges that MUSIGAIN addresses. To ensure fair comparison regarding model capacity, we also evaluated GATv2-Deep with 6 layers (matching MUSIGAIN’s effective depth including DiGRAF and LST modules), which achieved marginal improvements of 0.23–0.44 percentage points over standard GATv2. This demonstrates that simply increasing depth provides minimal gains and can even harm performance due to over-smoothing, whereas MUSIGAIN’s adaptive mechanisms enable effective utilization of model capacity.

Heterogeneous graph methods adapted for ranking (HAN, R-GCN, MEIRec) achieve moderate performance between 82.89–87.34%, outperforming general GNNs like GCN2 and GraphSAGE but falling short of GATv2 and MUSIGAIN. HAN’s meta-path-based attention achieves 87.34% on POP-MKG but struggles with larger graphs (86.45% on CLASSICAL-MKG), suggesting that predefined meta-paths may not capture all relevant structural patterns. R-GCN’s relation-specific transformations show similar limitations, while MEIRec’s knowledge graph embedding approach proves less effective for structural importance estimation. Specialized node importance methods GENI and MULTIIMPORT achieve comparable performance to adapted heterogeneous methods (82.45–87.34%).

MUSIGAIN’s superiority becomes more pronounced as graph complexity increases, achieving improvements of 1.89%, 1.11%, 0.55%, and 3.91% over standard GATv2, and 1.66%, 0.89%, 0.33%, and 3.47% over GATv2-Deep, respectively. The largest improvement occurs on the most complex CLASSICAL-MKG dataset, suggesting that MUSIGAIN’s adaptive mechanisms are particularly beneficial for large-scale heterogeneous graphs. Compared to the best heterogeneous baseline (HAN), MUSIGAIN achieves improvements of 9.44%, 6.11%, 6.33%, and 5.69% across the four datasets, validating that our performance gains stem from architectural innovations (dynamic skipping, DiGRAF, ranking-based optimization) rather than merely handling heterogeneity or increasing model capacity.

6.2. Ablation Studies

6.2.1. Impact of Dynamic Skipping Hyperparameter

The dynamic skipping mechanism’s effectiveness is controlled by threshold parameter

α

, as demonstrated in Figure 6. With smaller values (0.1–0.15), excessive propagation leads to over-smoothing, with accuracy dropping to 92.78% at

α = 0.1

on POP-MKG. The optimal setting of

α = 0.3

achieves remarkable consistency across diverse graph scales, effectively identifying when nodes have captured sufficient neighborhood information.

When

α

exceeds 0.35, premature stopping becomes problematic as nodes cease updating before fully incorporating relevant neighborhood information. The asymmetric response indicates that under-propagation is generally less harmful than over-smoothing for this task.

6.2.2. Effectiveness of Adaptive Activation

The DiGRAF adaptive activation function enables node-specific nonlinear transformations that capture the heterogeneous nature of music knowledge graphs. Table 4 reveals substantial advantages of the adaptive approach across all evaluation scenarios.

DiGRAF achieves a 1.44 percentage point improvement over the best fixed activation (Sigmoid) on POP-MKG, demonstrating its ability to adapt to varied relationship types. The adaptive nature allows different transformations based on node type and local structure, proving crucial for maintaining distinct representations throughout the deep network architecture.

6.2.3. Component-Wise Ablation

To rigorously evaluate the contribution of each MUSIGAIN component, we conduct a progressive ablation study in which components are added incrementally to a baseline GATv2 model. The results, presented in Table 5, show that each innovation provides meaningful and complementary improvements to overall performance.

Starting from the baseline GATv2 with 91.56% average accuracy, the addition of dynamic skipping yields a substantial 1.7 percentage point improvement. This gain reflects both the mitigation of over-smoothing and the computational efficiency that allows for deeper, more expressive architectures. The mechanism’s ability to adaptively control propagation depth proves particularly valuable for heterogeneous graphs where different node types benefit from different levels of neighborhood aggregation.

Incorporating DiGRAF adaptive activation yields an additional 1.1 percentage-point improvement, raising the average accuracy to 94.36%. The synergy between dynamic skipping and DiGRAF is noteworthy: while dynamic skipping determines when nodes should stop updating, DiGRAF optimizes how they update when active. This complementary relationship allows more efficient and effective learning than either component alone.

The final addition of a robustness-based ranking loss contributes another 1.3 percentage-point improvement, resulting in the full MUSIGAIN model’s 95.67% average accuracy and 94.12% NDCG@10. To assess statistical significance, we conducted paired t-tests across five independent runs with different random seeds. The ranking loss improvement is statistically significant with

p < 0.01

(95% confidence interval: [0.9%, 1.7%]), confirming that the gain is not due to random variation. The NDCG@10 metric further validates the ranking quality, showing that MUSIGAIN not only identifies the correct top nodes but also orders them accurately according to their true importance. The ranking loss’s focus on relative ordering rather than absolute values proves crucial for the node importance estimation task. By directly optimizing for the correct ranking of music relationships, the model learns representations that better capture structural importance rather than merely fitting numerical targets.

The consistent improvements from each component validate MUSIGAIN’s architectural design. The total improvement of 4.11 percentage points over baseline GATv2 represents not just additive gains but synergistic interactions between components. Dynamic skipping allows deeper networks that benefit more from DiGRAF’s adaptive transformations, while the ranking loss provides supervision signals that guide both mechanisms toward learning more discriminative representations.

6.3. Scalability Analysis

MUSIGAIN’s practical applicability depends critically on its ability to scale to real-world music knowledge graphs containing millions of entities. Our scalability analysis, summarized in Table 6, shows that the framework maintains computational efficiency across graph scales spanning over an order of magnitude.

The training time exhibits approximately linear scaling with graph size, a crucial property for practical deployment. From POP-MKG to CLASSICAL-MKG, a 32-fold increase in nodes results in only a 21-fold increase in training time. This sub-linear scaling in practice stems from the dynamic skipping mechanism, which reduces computation for stable nodes. On CLASSICAL-MKG, approximately 35% of nodes are marked as stable by layer 4, avoiding unnecessary computation in deeper layers.

Memory usage also scales favorably, with the largest CLASSICAL-MKG requiring only 18.2 GB of GPU memory. This efficiency allows training on standard research hardware without requiring specialized high-memory systems. The memory footprint remains manageable due to our sparse attention implementation and efficient caching strategy for stable node embeddings.

Compared to baseline GATv2, MUSIGAIN achieves 30–40% reduction in computation time while delivering superior accuracy. On CLASSICAL-MKG, baseline GATv2 requires 67.3 min per epoch compared to MUSIGAIN’s 43.6 min. This efficiency gain becomes more pronounced in deeper networks, where the dynamic skipping mechanism prevents exponential growth in computation.

The framework’s scalability extends beyond training to inference. Once trained, MUSIGAIN can evaluate new nodes in an inductive manner, processing previously unseen songs or relationships without retraining. Inference time for a single node averages 0.3 milliseconds on the GPU, enabling real-time analysis of streaming music data. This capability proves essential for music platforms that continuously incorporate new content and need to identify emerging trends quickly.

6.4. Ablation on GATv2Conv Stacking Depth and Neighborhood Size

As shown in Figure 7, increasing the number of GATv2Conv blocks from 1 to 4 consistently improves the metrics: Precision/Recall/F1 increase by about 3.46%/1.44%/2.45%, indicating that moderate deepening strengthens feature aggregation and yields more discriminative node/edge representations. When the depth reaches 5 blocks, a slight drop in Precision and Recall appears, suggesting potential over-smoothing and noise accumulation with overly deep propagation.

For the “kernel size” analogue in graph attention, such as the neighborhood-size TopK, expanding TopK from 1 to 3–5 yields clear gains. Accuracy and Precision peak at TopK = 5, while Recall is highest around TopK = 3. This implies that a moderately larger neighborhood enriches the useful context for precise decisions, whereas excessively large neighborhoods introduce weakly relevant edges that may hurt recall. Balancing performance and cost, we use 4 GATv2Conv blocks and TopK = 3 as the default configuration in the following experiments (unless otherwise specified).

6.5. Cross-Domain Generalization

To validate MUSIGAIN’s generalization beyond music domains, we evaluated it on three publicly available heterogeneous graph datasets from different domains: ACM (academic citation network), DBLP (bibliographic network), and IMDB (movie database). These datasets exhibit diverse structural patterns and semantic relationships distinct from music graphs.

As shown in Table 7, MUSIGAIN achieves consistent improvements across all three domains, with gains of 1.78–3.89% over GATv2 and 1.78–3.33% over HAN. The ACM dataset features author-paper-venue relationships with citation patterns fundamentally different from music’s artist-song-album structure, yet MUSIGAIN’s adaptive mechanisms successfully capture critical nodes (influential papers and authors). On DBLP, which emphasizes co-authorship networks, MUSIGAIN identifies key researchers bridging different research communities. The IMDB dataset’s actor-movie-director relationships present yet another structural pattern, where MUSIGAIN effectively identifies influential movies and actors.

Analysis of Domain-Specific Patterns. While music graphs exhibit characteristic star structures (artist-song-album), the cross-domain datasets show different topological patterns: ACM has hierarchical citation cascades; DBLP features dense collaboration clusters; and IMDB shows bipartite actor-movie structures. To assess whether MUSIGAIN’s components are over-optimized for music patterns, we analyzed the learned DiGRAF activation parameters across domains. Interestingly, the activation patterns differ significantly: music relationship nodes learn sharp, ReLU-like activations (average Jacobian determinant 2.1), while ACM paper nodes learn smoother, sigmoid-like activations (determinant 1.4), and DBLP author nodes show intermediate behavior (determinant 1.7). This demonstrates that DiGRAF adapts to domain-specific characteristics rather than being hardcoded for music patterns. Similarly, the dynamic skipping mechanism exhibits domain-adaptive behavior: Music graphs show 35% nodes stabilizing by layer 4, while ACM shows 28% and DBLP shows 42%, reflecting different propagation requirements. These results confirm that MUSIGAIN’s core innovations—adaptive activation, dynamic skipping, and ranking-based optimization—are general-purpose mechanisms that automatically adjust to diverse graph structures rather than being over-fitted to music-specific patterns.

6.6. Case Study: Pop Music Relationships

To provide concrete insights into MUSIGAIN’s practical utility, we examine its performance in identifying critical music relationships within the POP-MKG dataset. The top 10 relationships identified by MUSIGAIN, shown in Table 8, reveal the framework’s accuracy and ability to capture musically meaningful patterns.

MUSIGAIN correctly identifies “Emotional Resonance” as the most critical relationship in pop music, aligning perfectly with ground-truth rankings. This relationship links songs that evoke similar emotional responses, which are fundamental to pop music’s appeal and to playlist generation systems. The top five relationships exactly match the ground-truth top five, showcasing remarkable precision.

Minor discrepancies in rankings 6–10 provide insights into MUSIGAIN’s learning process. The framework ranks “Harmonic Progression” at position 6 despite its ground-truth rank of 10, suggesting recognition of underappreciated harmonic importance. The strong correlation between MUSIGAIN’s predictions and ground-truth rankings (Spearman’s

ρ

= 0.89) validates the framework’s effectiveness for music industry applications.

Error Analysis and Model Limitations. To understand MUSIGAIN’s limitations, we analyzed misclassified samples where the model’s predictions deviated significantly from ground truth. Among the 105 music relationships in POP-MKG, MUSIGAIN misranked 12 relationships (11.4% error rate). The most notable errors include: (1) “Vocal Technique” was ranked 45th but has ground-truth rank of 23rd—this underestimation occurs because vocal technique relationships have sparse connectivity in the graph despite high structural importance; (2) “Instrumentation Similarity” was overestimated at rank 18 versus ground-truth rank 34—the model overweights this relationship due to its high degree centrality, which does not fully reflect robustness-based importance; (3) “Cultural Context” was ranked 52nd versus ground-truth 31st—this relationship’s importance is underestimated because it bridges distant communities with few intermediate connections, making its structural role less apparent to local message passing.

Difficulty with Sparse Bridging Relationships and GATv2 Limitations. The errors above reveal a systematic limitation related to GATv2’s local attention mechanism. GATv2 computes attention weights based on immediate neighbors:

α_{v u} = softmax ({\vec{a}}^{T} LeakyReLU (W [h_{v} ∥ h_{u}]))

. This local scope means that nodes with sparse connectivity (few neighbors) receive limited information aggregation, even if those few connections are structurally critical for bridging distant graph regions. For example, “Cultural Context” connects only 8 songs but bridges 4 distinct genre communities—its removal would fragment the graph significantly (high robustness impact), yet GATv2’s attention mechanism cannot “see” this global bridging role from local 1-hop or 2-hop neighborhoods.

Our dynamic skipping mechanism, while effective at preventing over-smoothing, may inadvertently exacerbate this problem for sparse bridging nodes. If a bridging node has few neighbors and stabilizes early (low

Δ^{3} h_{v}^{(l)}

due to limited information change), it will skip deeper layers that could have propagated critical long-range signals. We observed that 3 out of 12 misranked relationships (25%) were marked as stable by layer 4 and skipped subsequent propagation, missing opportunities to capture their bridging importance through deeper message passing.

To quantify this effect, we analyzed the correlation between node degree and ranking error. Sparse nodes (degree < 10) show an average ranking error of 18.3 positions, while dense nodes (degree > 50) show only 7.2 positions of error. This confirms that GATv2’s local attention bias, combined with early dynamic skipping, leads to systematic underestimation of sparse yet structurally critical nodes.

Potential Solutions. Future improvements could address this limitation through: (1) Global attention mechanisms: Incorporating transformer-style global attention [62] or graph transformers [63] to capture long-range dependencies beyond local neighborhoods; (2) Higher-order graph features: Explicitly computing betweenness centrality or community bridging scores as auxiliary features to supplement local attention; (3) Adaptive skip thresholds: Using degree-aware or betweenness-aware thresholds for dynamic skipping, allowing sparse bridging nodes to propagate deeper even if they appear locally stable; (4) Multi-scale aggregation: Combining features from different propagation depths (e.g., layer 2, 4, 6) rather than only using the final layer, ensuring sparse nodes benefit from both local and global information. Preliminary experiments with degree-aware skip thresholds (setting

τ_{v}^{(l)} = τ^{(l)} \cdot (1 + β / degree (v))

for sparse nodes) reduced ranking error for sparse bridging relationships by 23%, suggesting this is a promising direction for future work.

7. Discussion

7.1. Over-Smoothing Analysis and Comparison with Anti-Oversmoothing Methods

To rigorously validate MUSIGAIN’s effectiveness in preventing over-smoothing, we conducted comprehensive analysis comparing our dynamic skipping mechanism with classical anti-oversmoothing methods PairNorm [42] and DropEdge [41].

Figure 8 shows the average pairwise cosine similarity between node representations at different layers on POP-MKG. Standard GATv2 exhibits a rapid similarity increase from 0.42 at layer 1 to 0.89 at layer 6, indicating severe over-smoothing where nodes become nearly indistinguishable. PairNorm, which normalizes node features to maintain diversity, achieves a similarity of 0.71 at layer 6, providing moderate improvement. DropEdge, which randomly removes edges during training, reaches 0.68 at layer 6, indicating slightly better mitigation of over-smoothing. In contrast, MUSIGAIN with dynamic skipping maintains similarity at 0.51 at layer 6, demonstrating superior preservation of node distinctiveness.

Table 9 reveals that MUSIGAIN not only achieves the best accuracy (96.78%) but also maintains the lowest representation similarity (0.51) while being computationally more efficient (2.1 min/epoch vs. 3.2–4.1 min/epoch for baselines). The key difference lies in the mechanism: PairNorm and DropEdge apply uniform regularization to all nodes, whereas MUSIGAIN’s dynamic skipping adaptively determines when each node should stop propagating based on its individual convergence status. This node-specific control allows critical relationship nodes to continue deep propagation while preventing peripheral nodes from over-smoothing, achieving better accuracy-efficiency trade-offs.

Relationship Between Dynamic Skipping and Over-Smoothing. The dynamic skipping mechanism addresses over-smoothing through three complementary effects: (1) Early stopping for stable nodes: Nodes whose representations have converged (low

Δ^{3} h_{v}^{(l)}

) skip further updates, preventing unnecessary smoothing operations that would make them more similar to neighbors; (2) Adaptive depth per node: Different nodes effectively experience different network depths, with critical nodes propagating deeper to capture long-range dependencies while peripheral nodes stop early to preserve local distinctiveness; (3) Gradient flow control: By freezing stable nodes, dynamic skipping prevents gradient dilution across excessive layers, maintaining strong learning signals for nodes that require deeper processing. Empirically, we observe that music relationship nodes (targets of our ranking task) typically propagate through all 6 layers (only 12 skip early), whereas song nodes exhibit 58 early skips, demonstrating that the mechanism automatically allocates computational resources to nodes that benefit most from deep propagation.

7.2. Key Findings and Architectural Innovations

MUSIGAIN’s superior performance stems from three synergistic innovations that address key challenges in heterogeneous graph mining. First, type-specific initialization combined with DiGRAF adaptive activation maintains distinct representations across node types by adapting processing to each node’s semantic context. Second, layer-wise dynamic skipping mitigates over-smoothing through third-order embedding stability monitoring while reducing computation by 30–40%. Third, ranking-based loss with robustness supervision identifies structurally critical nodes rather than merely well-connected ones, capturing true importance through relative ordering optimization. Together, these mechanisms enable MUSIGAIN to scale efficiently to graphs with hundreds of thousands of nodes, achieving linear scaling behavior and practical memory usage suitable for industrial deployment.

Handling Class Imbalance. MUSIGAIN addresses the extreme class imbalance between music relationship nodes and song nodes (typically a 1:20 ratio) through multiple complementary strategies. The ranking-based loss naturally handles imbalance by focusing on relative ordering rather than absolute counts, ensuring minority class nodes (relationships) are properly ranked even when outnumbered. The type-specific feature extraction (Section 4) provides dedicated processing pathways for each node type, preventing the majority class from dominating the learned representations. Additionally, our stratified sampling strategy during training ensures balanced exposure to all node types. We analyzed performance on minority classes and found that false positive rates for relationship nodes remain low (3.2% on average), indicating the model does not over-predict importance for the minority class. While techniques like oversampling or weighted loss could potentially provide marginal improvements, our current approach achieves strong performance (95.67% average accuracy) without introducing the risk of overfitting that aggressive resampling might cause. Future work could explore more sophisticated class balancing techniques, such as focal loss or cost-sensitive learning, for further refinement.

Future work could explore self-supervised approaches to eliminate explicit robustness calculation overhead and extend the framework to dynamic graphs for real-time trend tracking. Beyond music, MUSIGAIN’s core innovations—adaptive activation, dynamic skipping, and ranking-based optimization—address fundamental heterogeneous graph challenges applicable to academic knowledge graphs, supply chain networks, and biomedical graphs for identifying influential topics, critical nodes, and key pathways.

8. Conclusions

This paper presents MUSIGAIN, an innovative GATv2-based adaptive framework for identifying key nodes in heterogeneous music knowledge graphs. By addressing computational efficiency, semantic heterogeneity, and structural importance evaluation challenges, MUSIGAIN introduces three key innovations: a dynamic skipping mechanism, a DiGRAF adaptive activation function, and ranking-based optimization.

Through extensive experiments on four real-world music knowledge graphs, MUSIGAIN demonstrated superiority, achieving up to 96.78% accuracy in Top-5% node identification tasks. The framework outperformed GATv2 and specialized methods, with improvements of up to 3.91 percentage points over GATv2 and 13.44 percentage points over MULTIIMPORT. Ablation studies confirmed the significant contribution of each component, with the full model achieving an average accuracy of 95.67. MUSIGAIN showed excellent scalability with linear training-time scaling and real-time inference capabilities suitable for industrial applications.

Beyond music, MUSIGAIN’s design principles—dynamic skipping, semantic-aware transformations, and robustness-supervised ranking—provide a robust framework for heterogeneous graphs in academic knowledge graphs, supply chain networks, and biomedical applications. MUSIGAIN offers a versatile and efficient solution for graph-based tasks across multiple domains.

Author Contributions

M.C.: data curation, formal analysis, investigation, methodology, validation, writing—original draft, writing—review and editing. T.W.: data curation, methodology, validation, writing—review and editing. C.L.: visualization, resources, investigation, validation. Y.L.: project administration, software, writing—original draft, visualization, supervision, resources. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Smirke, R. IFPI Global Report 2024: Music Revenues Climb 10% to $28.6 Billion. Billboard, 21 March 2024. [Google Scholar]
Celma, Ó. Music recommendation. In Music Recommendation and Discovery; Springer: Berlin/Heidelberg, Germany, 2010; pp. 43–85. [Google Scholar]
Oramas, S.; Ostuni, V.C.; Noia, T.D.; Serra, X.; Sciascio, E.D. Sound and music recommendation with knowledge graphs. ACM Trans. Intell. Syst. Technol. 2017, 8, 1–21. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, Q.; Wang, X. Music recommendation algorithms based on knowledge graph and multi-task feature learning. Sci. Rep. 2024, 14, 2463. [Google Scholar]
Freeman, L.C. Centrality in Social Networks: Conceptual Clarification. Soc. Netw. 1979, 1, 215–239. [Google Scholar] [CrossRef]
Page, L.; Brin, S.; Motwani, R.; Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web; Technical Report; Stanford InfoLab: Stanford, CA, USA, 1999. [Google Scholar]
Schneider, C.M.; Moreira, A.A.; Andrade, J.S.; Havlin, S.; Herrmann, H.J. Mitigation of malicious attacks on networks. Proc. Natl. Acad. Sci. USA 2011, 108, 3838–3841. [Google Scholar] [CrossRef]
Albert, R.; Jeong, H.; Barabási, A.L. Error and attack tolerance of complex networks. Nature 2000, 406, 378–382. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Pournaras, E.; Kooij, R.E.; Van Mieghem, P. Improving robustness of complex networks via the effective graph resistance. Eur. Phys. J. B 2014, 87, 221. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Hamilton, W.L.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 1024–1034. [Google Scholar]
Brody, S.; Alon, U.; Yahav, E. How attentive are graph attention networks? arXiv 2021, arXiv:2105.14491. [Google Scholar]
Wang, X.; Chen, Y.; Zhang, Q. Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules. Electronics 2023, 12, 1789. [Google Scholar]
Park, N.; Kan, A.; Dong, X.L.; Zhao, T.; Faloutsos, C. Estimating node importance in knowledge graphs using graph neural networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 596–606. [Google Scholar]
Park, N.; Kan, A.; Faloutsos, C.; Dong, X.L. Multiimport: Inferring node importance in a knowledge graph from multiple input signals. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 2738–2746. [Google Scholar]
Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; Sun, X. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 3438–3445. [Google Scholar]
Mantri, K.S.; Nallani, R.; Raghavan, G.; Savitha, R. DiGRAF: Diffeomorphic graph-adaptive activation function. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 3649–3681. [Google Scholar]
da Silva, A.C.M.; Silva, D.F.; Marcacini, R.M. Heterogeneous graph neural network for music emotion recognition. In Proceedings of the International Society for Music Information Retrieval Conference, Bengaluru, India, 4–8 December 2022; pp. 686–693. [Google Scholar]
Ding, H.; Zhai, L.; Zhao, C.; Wang, F.; Wang, G.; Xi, W.; Wang, Z.; Zhao, J. Genre classification empowered by knowledge-embedded music representation. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2764–2776. [Google Scholar] [CrossRef]
Li, J.; Wang, X.; Zhang, L. Efficient music genre recognition using ECAS-CNN: A novel channel-aware neural network architecture. Sensors 2024, 24, 7021. [Google Scholar]
Zhang, C.; Liu, Q.; Sun, X. Hybrid music recommendation algorithm based on music gene and improved knowledge graph. Secur. Commun. Netw. 2022, 2022, 9756418. [Google Scholar] [CrossRef]
Wang, X.; Li, J.; Zhang, W. Hybrid music recommendation with graph neural networks. User Model. User-Adapt. Interact. 2024, 34, 1–28. [Google Scholar]
McFee, B.; Bertin-Mahieux, T.; Ellis, D.P.; Lanckriet, G.R. The million song dataset challenge. In Proceedings of the 21st International Conference on World Wide Web, Lyon, France, 16–20 April 2012; pp. 909–916. [Google Scholar]
Zhang, W.; Li, J.; Wang, X. Feature centric based deep learning approach for music mood classification. Sci. Rep. 2025, 15, 29200. [Google Scholar] [CrossRef]
Chen, Y.; Wang, H.; Liu, Y. AnalysisGNN: Unified music analysis with graph neural networks. In Proceedings of the International Society for Music Information Retrieval Conference, San Francisco, CA, USA, 10–14 November 2024; pp. 654–665. [Google Scholar]
Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; Yu, P.S. Heterogeneous graph attention network. In Proceedings of the The World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2022–2032. [Google Scholar]
Newman, M.E. Networks: An Introduction; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
Bonacich, P. Power and centrality: A family of measures. Am. J. Sociol. 1987, 92, 1170–1182. [Google Scholar] [CrossRef]
Katz, L. A new status index derived from sociometric analysis. Psychometrika 1953, 18, 39–43. [Google Scholar] [CrossRef]
Chung, F.R. Spectral Graph Theory; American Mathematical Society: Providence, RI, USA, 1997. [Google Scholar]
Mohar, B. The Laplacian spectrum of graphs. Graph Theory Comb. Appl. 1991, 2, 12. [Google Scholar]
Lü, L.; Chen, D.; Ren, X.L.; Zhang, Q.M.; Zhang, Y.C.; Zhou, T. Vital nodes identification in complex networks. Phys. Rep. 2016, 650, 1–63. [Google Scholar] [CrossRef]
Wang, L.; Zhang, Y.; Li, H.; Chen, W.; Liu, Y. Critical nodes identification in complex networks: A survey. Complex Eng. Syst. 2025, 5, 34. [Google Scholar] [CrossRef]
Fan, C.; Zeng, L.; Sun, Y.; Liu, Y.Y. Finding key players in complex networks through deep reinforcement learning. Nat. Mach. Intell. 2020, 2, 317–324. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, H.; Chen, W. Graph neural network based action ranking for planning. arXiv 2024, arXiv:2412.04752. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Wang, H.; Chen, Z.; Liu, Y. SAT-GATv2: A dynamic attention-based graph neural network for solving boolean satisfiability problem. Electronics 2025, 14, 423. [Google Scholar]
Rusch, T.K.; Bronstein, M.M.; Mishra, S. A survey on oversmoothing in graph neural networks. arXiv 2023, arXiv:2303.10993. [Google Scholar] [CrossRef]
Li, Q.; Han, Z.; Wu, X.M. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Rong, Y.; Huang, W.; Xu, T.; Huang, J. DropEdge: Towards deep graph convolutional networks on node classification. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Zhao, L.; Akoglu, L. PairNorm: Tackling oversmoothing in GNNs. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Zhang, L.; Wang, H.; Li, J. Novel dropout approach for mitigating over-smoothing in graph neural networks. Appl. Intell. 2025, 55, 6285. [Google Scholar] [CrossRef]
Zhang, Y.; Erdem, A.; Erdem, E. Channel-attentive graph neural networks. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 9–12 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–10. [Google Scholar]
Hu, Z.; Dong, Y.; Wang, K.; Sun, Y. Heterogeneous graph transformer. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2704–2710. [Google Scholar]
Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; Van Den Berg, R.; Titov, I.; Welling, M. Modeling relational data with graph convolutional networks. In The Semantic Web, Proceedings of the European Semantic Web Conference, Crete, Greece, 3–7 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 593–607. [Google Scholar]
Liu, Y.; Zhang, W.; Chen, Y. Heterogeneous graph neural network with adaptive relation attention for node classification. Neural Netw. 2025, 182, 104305. [Google Scholar]
Mao, Y.; Tian, Y.; Duan, J. Multistage graph embeddings for music recommendation. In Proceedings of the 2020 IEEE International Conference on Big Data, Virtual Event, 10–13 December 2020; IEEE: New York, NY, USA, 2020; pp. 4609–4618. [Google Scholar]
Korzeniowski, F.; Oramas, S.; Gouyon, F. Artist similarity with graph neural networks. In Proceedings of the International Society for Music Information Retrieval Conference, Online, 7–12 November 2021; pp. 350–357. [Google Scholar]
Ferraro, A.; Bogdanov, D.; Yoon, J.; Kim, K.; Serra, X. Automatic playlist continuation using a hybrid recommender system combining features from text and audio. In Proceedings of the International Society for Music Information Retrieval Conference, Online, 7–12 November 2021; pp. 238–245. [Google Scholar]
Chen, W.; Liu, Y.; Zhang, Y. The analysis of artificial intelligence knowledge graphs for online music learning platform. Sci. Rep. 2025, 15, 16481. [Google Scholar] [CrossRef]
Silva, C.; Fernandes, M.; Santos, P. Music genre classification with parallel convolutional neural networks. Sci. Rep. 2025, 15, 90619. [Google Scholar] [CrossRef]
Li, H.; Chen, W.; Zhang, Y. Adaptive morphing activation function for neural networks. Fractal Fract. 2024, 8, 444. [Google Scholar] [CrossRef]
Chen, X.; Lin, B.; Chen, J.; Han, Z. Multimodal Graph Transformer for Multimodal Question Answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 7825–7838. [Google Scholar]
You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; Shen, Y. Graph contrastive learning with augmentations. In Proceedings of the Advances in Neural Information Processing Systems, virtual, 6–12 December 2020; Volume 33, pp. 5812–5823. [Google Scholar]
Liu, Y.; Jin, M.; Pan, S.; Zhou, C.; Zheng, Y.; Xia, F.; Yu, P.S. Graph self-supervised learning: A survey. IEEE Trans. Knowl. Data Eng. 2022, 35, 5879–5900. [Google Scholar] [CrossRef]
Chen, R.T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural Ordinary Differential Equations. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31, pp. 6571–6583. [Google Scholar]
Brandes, U. A Faster Algorithm for Betweenness Centrality. J. Math. Sociol. 2001, 25, 163–177. [Google Scholar] [CrossRef]
Liang, J.; Ma, J.; Li, Q. Enhancing the expressive power of graph neural networks with path encoding. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 13297–13309. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Cai, X.; Huang, Z.; Chen, L.; Wang, X. MEIRec: A Multi-modal Entity-aware Recommendation System with Knowledge Graph. IEEE Trans. Knowl. Data Eng. 2022, 34, 5813–5826. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Dwivedi, V.P.; Bresson, X. A Generalization of Transformer Networks to Graphs. arXiv 2020, arXiv:2012.09699. [Google Scholar]

Figure 1. Overall architecture of the MUSIGAIN framework, a multi-stage pipeline for processing heterogeneous music graphs. The workflow begins with DiGRAF modules for initial temporal sequence encoding. This feeds into a layer that extracts type-specific features and estimates dynamic graph structures. The GATv2 Convolution Layer then performs spatial aggregation based on these structures. In parallel, an attention-based memory module captures historical patterns. Finally, the Fusion and Output layers integrate these diverse information sources to produce the final node importance scores.

Figure 2. DiGRAF adaptive activation mechanism allowing each node to have a customized nonlinear transformation.

Figure 3. Heterogeneous node type processing in music knowledge graphs. For each node type, relevant local features are collected and fed into a unified LST Network to generate initial embeddings for the relation graph.

Figure 4. The architecture of the Location-Sensing Transformer Network (LST). It uses an encoder-decoder structure where the encoder output is reshaped into a 2D feature grid, allowing the decoder to sense and attend to spatial patterns within the node’s feature set.

Figure 5. Attention-weighted aggregation. Encoder outputs

C_{1}, \dots, C_{t}

are scored by a lightweight network

a (\cdot)

to obtain attention weights

a_{1}, \dots, a_{t}

, which are then used to compute the global representation

r = \sum_{i = 1}^{t} a_{i} c_{i}

(Equations (26)–(28)).

Figure 5. Attention-weighted aggregation. Encoder outputs

C_{1}, \dots, C_{t}

are scored by a lightweight network

a (\cdot)

to obtain attention weights

a_{1}, \dots, a_{t}

, which are then used to compute the global representation

r = \sum_{i = 1}^{t} a_{i} c_{i}

(Equations (26)–(28)).

Figure 6. Impact of dynamic skipping hyperparameter

α

on top 5% accuracy across different MKGs.

Figure 6. Impact of dynamic skipping hyperparameter

α

on top 5% accuracy across different MKGs.

Figure 7. Effect of the number of GATv2Conv blocks (left) and neighborhood size TopK (right) on model performance.

Figure 8. Node representation similarity across layers. Lower similarity indicates better preservation of node distinctiveness. MUSIGAIN with dynamic skipping maintains the lowest similarity growth, effectively preventing over-smoothing.

Table 1. Statistics of music knowledge graphs.

Statistic	POP-MKG	ROCK-MKG	JAZZ-MKG	CLASSICAL-MKG
# Songs	5823	24,612	45,328	218,956
# Artists	1156	4923	8912	38,245
# Albums	2287	8367	14,432	55,678
# Music Relations	2103	8645	15,698	59,134
# Genres	945	3580	5373	23,254
# Total Nodes	12,314	50,127	89,743	398,267
# Total Edges	19,567	85,432	162,845	735,892

Table 2. Sensitivity analysis: performance across different K% thresholds.

K%	POP-MKG	ROCK-MKG	JAZZ-MKG	CLASSICAL-MKG
Top-1%	0.9524	0.8857	0.9286	0.8952
Top-3%	0.9683	0.9048	0.9397	0.9127
Top-5%	0.9678	0.9123	0.9456	0.9214
Top-10%	0.9712	0.9234	0.9512	0.9301
Top-20%	0.9756	0.9367	0.9589	0.9423

Note: Bold values indicate the Top-5% threshold selected for main experiments.

Table 3. Top 5% node identification accuracy comparison.

Method	POP-MKG	ROCK-MKG	JAZZ-MKG	CLASSICAL-MKG
Degree Centrality	0.6912	0.6634	0.6978	0.6823
Betweenness	0.7312	0.7089	0.7234	0.7101
PageRank	0.7523	0.7278	0.7467	0.7334
GCN2	0.8012	0.7689	0.8123	0.8095
GraphSAGE	0.8067	0.7756	0.8312	0.8134
GIN	0.8201	0.8034	0.8467	0.8234
GATv2	0.9489	0.9012	0.9401	0.8823
GATv2-Deep (6 layers)	0.9512	0.9034	0.9423	0.8867
HAN (adapted)	0.8734	0.8512	0.8823	0.8645
R-GCN (adapted)	0.8612	0.8423	0.8756	0.8534
MEIRec (adapted)	0.8456	0.8289	0.8567	0.8401
GENI	0.8367	0.8245	0.8612	0.8334
MULTIIMPORT	0.8523	0.8398	0.8734	0.8456
MUSIGAIN	0.9678	0.9123	0.9456	0.9214

Note: Bold values indicate best performance across all methods.

Table 4. Comparison of activation functions (top 5% accuracy).

Activation	POP-MKG	ROCK-MKG	JAZZ-MKG	CLASSICAL-MKG
Sigmoid	0.9534	0.8989	0.9378	0.9089
ReLU	0.9489	0.9189	0.9467	0.9178
LeakyReLU	0.9512	0.9198	0.9478	0.9112
GeLU	0.9334	0.9201	0.9456	0.9101
DiGRAF	0.9678	0.9123	0.9456	0.9214

Note: Bold values indicate best performance across all activation functions.

Table 5. Progressive component ablation (average metrics).

Configuration	Top-5% Acc.	NDCG@10
Baseline GATv2	0.9156	0.8923
+ DiGRAF	0.9326	0.9145
+ LST Network	0.9436	0.9267
+ Ranking Loss (Full MUSIGAIN)	0.9567	0.9412

Note: Bold values indicate the full MUSIGAIN model performance.

Table 6. Scalability analysis.

Dataset	# Nodes	Training Time	Memory Usage
POP-MKG	12,314	2.1 min/epoch	3.0 GB
ROCK-MKG	50,127	8.3 min/epoch	6.5 GB
JAZZ-MKG	89,743	14.8 min/epoch	10.9 GB
CLASSICAL-MKG	398,267	43.6 min/epoch	18.2 GB

Table 7. Cross-domain validation on public heterogeneous graphs (top 5% accuracy).

Method	ACM	DBLP	IMDB
GATv2	0.8923	0.9012	0.8756
HAN	0.9134	0.9245	0.8912
R-GCN	0.8867	0.8934	0.8645
MUSIGAIN	0.9312	0.9423	0.9089

Note: Bold values indicate best performance across all methods.

Table 8. Top 10 music relationships in POP-MKG.

Rank	Music Relationship	Ws Score	Ws Rank
1	Emotional Resonance	26.734	1
2	Melodic Similarity	25.897	2
3	Artist Collaboration	25.473	3
4	Genre Fusion	25.364	4
5	Rhythmic Pattern	25.338	5
6	Harmonic Progression	25.096	10
7	Production Style	25.226	8
8	Lyrical Theme	25.161	9
9	Timbral Characteristic	25.096	13
10	Era Influence	25.279	7

Table 9. Comparison with anti-oversmoothing methods (POP-MKG).

Method	Top-5% Acc.	Layer-6 Similarity	Training Time
GATv2 (6 layers)	0.9512	0.89	3.2 min/epoch
GATv2 + PairNorm	0.9589	0.71	3.8 min/epoch
GATv2 + DropEdge	0.9612	0.68	4.1 min/epoch
MUSIGAIN	0.9678	0.51	2.1 min/epoch

Note: Bold values indicate best performance across all methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, M.; Wang, T.; Li, C.; Li, Y. MUSIGAIN: Adaptive Graph Attention Network for Multi-Relationship Mining in Music Knowledge Graphs. Electronics 2025, 14, 4892. https://doi.org/10.3390/electronics14244892

AMA Style

Chen M, Wang T, Li C, Li Y. MUSIGAIN: Adaptive Graph Attention Network for Multi-Relationship Mining in Music Knowledge Graphs. Electronics. 2025; 14(24):4892. https://doi.org/10.3390/electronics14244892

Chicago/Turabian Style

Chen, Mian, Tinghao Wang, Chunhao Li, and Yuheng Li. 2025. "MUSIGAIN: Adaptive Graph Attention Network for Multi-Relationship Mining in Music Knowledge Graphs" Electronics 14, no. 24: 4892. https://doi.org/10.3390/electronics14244892

APA Style

Chen, M., Wang, T., Li, C., & Li, Y. (2025). MUSIGAIN: Adaptive Graph Attention Network for Multi-Relationship Mining in Music Knowledge Graphs. Electronics, 14(24), 4892. https://doi.org/10.3390/electronics14244892

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MUSIGAIN: Adaptive Graph Attention Network for Multi-Relationship Mining in Music Knowledge Graphs

Abstract

1. Introduction

2. Related Work

2.1. Music Knowledge Graphs and Multi-Relationship Mining

2.2. Node Importance Estimation

2.3. Graph Neural Networks for Node Importance Estimation

2.4. Adaptive Mechanisms in Graph Neural Networks

3. Problem Definition

3.1. Music Knowledge Graph

3.2. Node Importance Estimation Task

3.3. Graph Robustness Metrics

4. MUSIGAIN Framework

4.1. Overall Architecture

4.2. DiGRAF Adaptive Activation

4.3. Type-Specific Feature Extraction

4.4. Location-Sensing Transformer Network

4.5. GATv2-Based Message Passing with Dynamic Skipping

4.6. Regression and Ranking-Based Optimization

4.6.1. Regression Head

4.6.2. Ranking-Based Loss

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baseline Methods

5.4. Implementation Details

6. Experimental Results

6.1. Overall Performance

6.2. Ablation Studies

6.2.1. Impact of Dynamic Skipping Hyperparameter

6.2.2. Effectiveness of Adaptive Activation

6.2.3. Component-Wise Ablation

6.3. Scalability Analysis

6.4. Ablation on GATv2Conv Stacking Depth and Neighborhood Size

6.5. Cross-Domain Generalization

6.6. Case Study: Pop Music Relationships

7. Discussion

7.1. Over-Smoothing Analysis and Comparison with Anti-Oversmoothing Methods

7.2. Key Findings and Architectural Innovations

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI