1. Introduction
Graph data, as a prevalent form of non-Euclidean data, are widely present across numerous real-world domains, including transportation networks [
1], citation networks [
2], biological protein interactions [
3], and knowledge graphs [
4]. To extract valuable insights from these complex graph datasets, graph representation learning has emerged. Its main objective is to map nodes, subgraphs, or entire graphs onto low-dimensional embedding spaces while maximally preserving their topological structure, node attributes, and rich semantic information [
5]. The learned low-dimensional vector representations serve as effective features that significantly empower downstream machine learning tasks such as node classification [
6], link prediction [
7], and graph classification [
8]. Consequently, graph representation learning has become one of the key technologies in graph data mining [
9].
With recent advances in deep learning, graph neural networks (GNNs) [
10] have achieved significant success in graph representation learning. GNNs aim to extend convolutional operations from conventional domains to arbitrary topologies and unordered structures, including space-based methods [
11,
12] and spectrum-based methods [
13,
14]. Currently, most GNNs are essentially flat, as they propagate node information solely through edges and obtain graph representations by globally summarizing node representations. These aggregation methods include averaging across all nodes [
15], adding virtual nodes [
16], or using connectedness layers [
17] or convolutional layers [
18]. However, many real-world graph objects (such as molecules and proteins) are composed of reusable, functionally specific substructures (e.g., functional groups and structural motifs) [
19], which can be regarded as factor subgraphs. Existing methods often couple this factor information into holistic, unstructured graph-level representations during the learning process, leading to limitations in model interpretability, generalization ability, and sensitivity to discriminative substructures. Without understanding the principles behind predictions, these black-box models cannot be fully trusted or widely applied in critical fields such as medical diagnosis. Furthermore, model interpretation facilitates model debugging and error analysis.
To enhance interpretability, researchers have begun exploring subgraph-based graph neural networks. These methods can explain node or graph classification predictions from GNNs trained using different strategies, broadly falling into two categories. On one hand, some methods enhance model interpretability by integrating graph kernels with graph neural networks [
20,
21,
22]. This is achieved by comparing a set of learnable hidden subgraphs with the input graph to obtain the final graph embedding representation. However, the size and number of hidden subgraphs require manual tuning, and they cannot detect key subgraphs of arbitrary size and shape without explicit subgraph-level annotations. On the other hand, some approaches integrate subgraph information into the message-passing mechanism of graph neural networks [
23,
24,
25], with the main objective of mining critical subgraphs reflecting intrinsic graph properties. These methods rely on optimization decision theory to ensure interpretability of generated subgraphs [
26]. These methods typically rely on information from only a single type of subgraph. However, since a graph’s properties may be influenced by more than one subgraph, it is not possible to simply compress the information in the graph into a single graph. For example, the properties of a chemical molecule may be determined by multiple functional groups [
27].
Although existing subgraph-based neural networks demonstrate strong performance and enhance the interpretability of GNNs, current approaches primarily suffer from two main limitations:
- (1)
Insufficient Structural Disentanglement Capability: Node features learned by mainstream GNNs and graph autoencoders are typically holistic vectors, making it difficult to explicitly decompose and identify latent substructures or factors with distinct semantics within the graph (e.g., representing different functional groups in molecular graphs or distinct community patterns in social networks). This coupling hinders the model’s deep understanding of the underlying principles of graph composition.
- (2)
Lack of Prototyping and Generalization Capabilities: Many graph classification models directly classify whole-graph features without mining reusable “prototype subgraph structures” shared across graphs. This hinders models from capturing the most discriminative structural patterns at the category level, resulting in weak generalization and interpretability when data are scarce or when dealing with heterogeneous graphs.
To address these challenges, this paper proposes Prototype Subgraph Disentangled Graph Neural Network, a graph representation learning framework based on latent factor decomposition and prototype alignment. Our key idea is to explicitly disentangle learned node features into multiple independent latent factors and align these factors with a set of learnable, discriminative prototype subgraph structures. Specifically, the model first acquires node features via a graph autoencoder and then innovatively partitions these features into several independent latent factor groups based on dimensionality. Each factor group is transformed into a factor subgraph through a subgraph generator, explicitly modeling latent internal substructures. Furthermore, we introduce a set of learnable prototype subgraphs, each representing a fundamental structural pattern shared across graphs. We design a factor–prototype one-to-one correspondence and similarity matching mechanism, supplemented by a mutual information maximization objective, to encourage learned factors to align structurally and semantically with their most similar prototypes. The main contributions of this work can be summarized as follows:
- (1)
We propose a factor subgraph disentangled mechanism that explicitly models and decomposes latent semantic substructures within graphs by partitioning node features into multiple groups and constructing factor subgraphs, thereby enhancing model interpretability.
- (2)
We design a prototype alignment and consistency learning paradigm. By introducing learnable prototype subgraphs and aligning decomposed factors with discriminative prototype subgraphs through similarity matching and factor subgraph mutual information loss, we learn category-sensitive, generalizable subgraph structures.
- (3)
Experiments on multiple public benchmark datasets demonstrate that our model not only achieves superior classification performance compared with existing baseline methods but also provides intuitive, structured explanations for classification decisions through the visual disentanglement of factor subgraphs and prototype subgraphs.
It is worth noting that recent advances in large language model (LLM)-based graph reasoning and graph-aware intelligent agents have shown emerging potential in complex graph data analysis. These methods typically rely on the accurate identification and representation of semantic substructures within graphs. However, existing LLM and agent-based approaches still face limitations in structural interpretability and the efficient modeling of crossgraph sharable prototype structures. The proposed factor subgraph disentanglement and prototype alignment mechanism in this paper is expected to provide a structured and interpretable subgraph representation paradigm for these directions, thereby complementing the graph understanding capabilities of large models and intelligent agents in terms of interpretability and structural generalization.
This paper is organized as follows: In
Section 2, we review some work related to subgraph neural networks and the combination of graph kernels and GNNs.
Section 3 introduces some implementation details regarding PSDGNN. In
Section 4, we conduct a series of experiments to evaluate the performance of PSDGNN. Finally, we conclude the paper in
Section 5.
3. Methodology
The proposed PSDGNN framework, as shown in
Figure 1, provides a brief introduction to the model architecture.
Problem Formulation. A graph can be represented as
, where
is a set of nodes and
denotes the node feature matrix. The number of nodes is denoted by
N, and the dimension of the node features is
d.
A denotes the adjacency matrix of the graph. The graphs studied in this paper are all unweighted undirected graphs. For the graph supervised classification task, given a dataset
, where
denotes the label of
, the objective of the graph classification task is to learn the mapping function from the graph
to the label
. This paper employs one-hot encoding to process discrete labels. As an illustration, the three labels
correspond to the three-dimensional vectors
,
, and
, respectively. In addition, we summarize the main notations in
Table 2.
3.1. Representation Disentangled Module
Given a graph
, we use a graph variational autoencoder (GraphVAE) to learn
m disentangled latent factors
, where
,
and
denotes the dimension of the
lth hidden layer. These latent factors originate from the grouping of node features. Our improved GraphVAE employs a basic graph convolutional network (GCN) as the graph encoder, for which the output of the
l-th layer is given by
where
A denotes the normalized adjacency matrix;
, where
I is the identity matrix;
D is the diagonal matrix of node degrees; and
is the Sigmoid activation function.
In the decoder of the improved GraphVAE, we employ separate heads: a multi-layer perceptron for reconstructing
X and a linear inner-product decoder for recovering
A. Specifically, we define the graph reconstruction as
where
is the reconstructed adjacency matrix and
denotes the reconstructed node features. Our multi-head GraphVAE aims to minimize reconstruction error while maximizing the compression of latent variables
Z. The objective is formulated as follows:
where
denotes the Frobenius norm and
represents the graph encoder. Furthermore, to ensure independence among latent factors, for graph representations, the mutual information (MI) between them reaches a minimum value of 0 when
, indicating that
and
are mutually independent. Thus, minimizing the MI between them encourages factor representations to learn information from different aspects of the graph. Recently, several MI upper bounds [
37] have been introduced to minimize MI. However, estimating these MI upper bounds across
m graph representations requires
estimations, leading to significantly higher computational costs, especially when
m is large. To mitigate this issue, since orthogonality is a specific instance of linear independence, we relax the orthogonality constraint for minimizing MI. This approach has also been proven effective in many previous studies [
38]. The factor independence loss is defined as follows:
where
denotes the L1 norm and
I represents the identity matrix. The complete algorithm of the representation disentangling module is shown in Algorithm 1.
| Algorithm 1 Representation disentangling module of PSDGNN |
- 1:
Input: Graph , number of latent factors m, number of encoder layers L - 2:
Output: Latent factors , reconstructed adjacency , reconstructed features - 3:
Compute normalized adjacency matrix - 4:
Initialize - 5:
for to L do - 6:
GCN encoding using Equation (1) - 7:
end for - 8:
Set - 9:
Split Z into m equal parts along the feature dimension: - 10:
for to m do - 11:
▹i-th latent factor - 12:
end for - 13:
Reconstruct adjacency matrix using Equation ( 2) - 14:
Reconstruct node features using Equation ( 3) - 15:
Compute Equation ( 4) - 16:
Compute using Equation ( 5)
|
3.2. Factor Subgraph Generation Module
The workflow of the factor subgraph generation module is illustrated in
Figure 2. Based on the previously obtained latent factors, we compute the probabilities for each edge in the factor subgraph. First, we learn to compute the inner product of each factor subgraph, where each entry represents an edge selection probability. Then, using the Sigmoid function, we obtain the attention mask
for the
m-th edge in the factor subgraph, ensuring that
, where
denotes the probability of selecting the edge between node
and node
.
is calculated as follows:
Subsequently, we binarize
to obtain the edge assignment
. To ensure that the gradient
is computable, we employ the Gumbel–Softmax reparameterization trick [
39] to update the edge assignment. However, if we directly apply the Gumbel–Softmax function to
, we can only select an edge from every
n consecutive edges. To ensure that sufficient edges are continuous, we reshape
into an
L-dimensional matrix and then apply binarization. We utilize the Gumbel–Softmax method to generate the edge assignment
. This guarantees that at least an edge is selected from every
L continuous edges. Here, the
l-th edge sample probability of the
L-dimensional sample vector is defined as
where
is the temperature of the concrete distribution,
denotes the edge selection probability for the
l-th group after partitioning,
is the sample probability, and
is generated from the
distribution. Thus, based on this process,
L can determine the size of each potential subgraph
, and we can retain
edges. Then, we transform
into an
matrix. Finally, extract the factor subgraph
via
, whose node feature is
, where ⊙ denotes element-wise multiplication.
A and
represent the adjacency matrices of the input graph and the
m-th factor subgraph, respectively. This yields the entire set of potential factor subgraphs
.
3.3. Prototype Kernel Embedding Module
Recent prototype-based GNNs implicitly represent prototype subgraphs by defining a set of learnable graph embedding vectors, which compromises their inherent interpretability. Therefore, we aim to explicitly define prototype subgraphs. To achieve this, we propose using a random walk graph kernel to explicitly explore graph topology. Our prototype kernel embedding module introduces learnable prototype subgraphs parameterized by trainable adjacency matrices. Formally, our module comprises a set of prototype subgraphs of equal size for each factor subgraph . Each prototype subgraph is parameterized as a less-parametric undirected graph, which corresponds one-to-one with the previously obtained factor subgraph via subscripting. These prototype subgraphs are expected to learn prototype structures that aid in distinguishing available classes. Inspired by the fact that random walk kernels quantify similarity between two graphs based on the number of common walks, we compare each factor subgraph with its corresponding prototype subgraph equipped with differentiable functions derived from random walk kernels.
Specifically, given a factor subgraph
and its corresponding prototype subgraph
, their direct product graph
has an adjacency matrix
, where ⊗ denotes the Kronecker product between two matrices. It can be observed that a random walk on
can be interpreted as a simultaneous walk on graphs
and
[
40]. Considering that the traditional random walk kernel computes all matched walk pairs between two graphs, the number of matched walks when traversing nodes on both
and
is equivalent to the number of matched walks in the adjacency matrix of
, denoted
of
, when traversing nodes on
and
simultaneously. Therefore, the random walk kernel for
P steps between
and
P that computes all simultaneous random walks is defined as
where
denotes the one-dimensional vector resulting from unfolding the similarity matrix of node features between
and
into a vector.
denotes the positive weight. To simplify computation, we only calculate the number of steps with a common length of
P on the two comparison graphs:
Then, given a set of factor subgraphs
and a set of prototype subgraphs
, we calculate the similarity between the factor subgraph and the atomic graph according to the subscript correspondence:
Unlike traditional graph kernels where both graphs are fixed, our scenario requires the prototype graph to be learnable. The kernel computes a joint similarity that integrates two complementary types:
- (1)
Walk-based topological similarity. Captured by the direct product graph adjacency matrix , where is learnable. This term measures structural alignment by counting common walk sequences of length P.
- (2)
Feature-guided similarity. Encoded by , where is also learnable. This term weights each walk by the similarity of node features between the two graphs.
The final graph embedding representation is obtained as and ultimately implemented through a multi-layer perceptron to accomplish the graph classification task.
3.4. Model Optimization
The final component of PSDGNN takes the similarity representation of factor subgraphs and atomic graphs as input, outputting the predicted probability for each class:
, where the function
f is an MLP followed by a Softmax layer. For optimization, cross-entropy loss is employed to measure classification loss:
where
N denotes the number of graphs and
represents true label.
Combined with the previous image encoder loss and factor mutual information loss, the final optimization objective of PSDGNN is
where
and
are hyperparameters that adjust the weights of the loss function. The complete algorithm of PSDGNN is shown in Algorithm 2.
| Algorithm 2 Training process of PSDGNN |
Input: ; ; Number of factor subgraphs m; Number of layers l; Number of epochs Output: Predicted label of the graph - 1:
Initialize the adjacency matrix and feature matrix of the prototype subgraph. - 2:
Encoding node features using Equation ( 1) to obtain - 3:
Decompose into m latent factors based on the feature dimension - 4:
Reconstruct the graph using Equations ( 2) and ( 3). - 5:
for
do - 6:
for do - 7:
Calculate the edge mask matrix using Equation ( 6) - 8:
Reshape the edge mask matrix into an L-dimensional matrix - 9:
Select edges using Equations ( 7) and ( 8) - 10:
The factor subgraph is generated by ▹ Generate factor subgraph via element-wise masking - 11:
The similarity between factor subgraph and prototype subgraph is calculated by Equation ( 11) ▹ Concatenate similarities into graph embedding - 12:
end for - 13:
Generate predicted graph label for the graph using MLP and Softmax. - 14:
Calculate the overall loss by Equation ( 13) - 15:
Update parameters by descending ▹ End-to-end optimization via backpropagation - 16:
end for
|
3.5. Complexity Analysis
In this section, we will analyze the computational complexity and consumption of PSDGNN. The overall time complexity and memory complexity of one graph of PSDGNN are respectively and , where N denotes the number of nodes, is the average node degree, d is the hidden dimension and m is the number of factors. Specifically, the time complexity and memory complexity of the representation disentangled module are and . The factor subgraph generation module takes and in terms of time complexity and memory complexity, respectively. For the prototype kernel embedding module, time complexity and memory complexity are and .
3.6. Discussion
Identifiability of latent factor decomposition. A limitation of the proposed factor decomposition is that without additional constraints, the decomposition of node features into m latent factors is not strictly identifiable. In general, any invertible transformation applied to the latent factors can produce an equivalent reconstruction, a common challenge in disentangled representation learning. To address this from a practical perspective, PSDGNN incorporates two mechanisms that promote factor uniqueness and disentanglement:
- (1)
Factor mutual information loss (). Following prior work on mutual information-based disentanglement [
37,
38], we minimize the mutual information among different factor subgraphs. This encourages each factor to capture distinct, non-redundant structural patterns, reducing the degrees of freedom in the decomposition space and promoting a form of practical identifiability.
- (2)
One-to-one prototype alignment. Each factor subgraph is forced to align with a dedicated learnable prototype subgraph through the random walk kernel similarity. This additional supervision signal anchors each factor to a specific structural prototype, further constraining the decomposition space. While the prototype–factor mapping is one-to-one, overlapping substructures in the original graph are not discarded. Instead, they are captured through the composition of multiple prototypes in the later factor aggregation stage. For example, in a molecular graph with overlapping functional groups (e.g., a benzene ring fused with a pyrrole ring), different prototypes activate for different subregions, and the factor representation aggregates these prototype activations to form a holistic representation. The one-to-one constraint applies to the mapping definition, not to the activation pattern—a single node or edge can contribute to multiple prototypes, and multiple prototypes can jointly represent overlapping semantics.
4. Experiments
In this section, we first describe the experimental setup and present graph classification experiments conducted on seven real-world benchmark datasets to evaluate the performance of the proposed method. Next, we explore the stability and convergence of model training. Subsequently, ablation experiments are performed to demonstrate the effectiveness of the various modules in our framework, and parameter sensitivity experiments are carried out to empirically examine the influence of different hyperparameters. Finally, we present a visualization analysis to investigate PSDGNN’s encoding variability and interpretability.
4.1. Experiments on Benchmark Datasets
Datasets. The seven benchmark datasets used in this study are among the most commonly used public datasets in graph classification. These datasets exhibit significant diversity in graph size, node feature richness, and number of classes, enabling comprehensive evaluation of PSDGNN’s generalization ability and robustness across different scenarios, and they are widely recognized as standard evaluation benchmarks in the community, covering three representative domains, chemical compounds, biological proteins, and social networks, as detailed in
Table 3. Seven benchmark datasets were utilized in the experiments. For the chemical compound datasets, MUTAG [
19], NCI1 [
41], and PTC [
42] were selected. MUTAG classifies nitro-containing molecules as mutagenic or non-mutagenic aromatic or heteroaromatic compounds. NCI1 screens compounds for activity against non-small-cell lung cancer, while PTC predicts carcinogenicity in rodents. For biological protein datasets, PROTEINS [
43] and DD [
44] were selected, both designed for enzyme classification tasks. The PROTEINS and DD datasets were used to determine whether protein molecules belong to the enzyme class. For the social network dataset, IMDB-B [
45] and IMDB-M [
45] were selected. The IMDB datasets originate from the Internet Movie Database and represent movie collaboration networks. This task involves classifying actor subgraphs into corresponding movie genres.
Baseline. This paper selects several representative benchmark methods for graph classification tasks, including graph kernels, graph neural networks, and hybrid approaches combining both, to demonstrate the superior performance of PSDGNN. Traditional graph kernel methods include WL kernel [
46], RetGK [
47], AEGK [
48], and DSP-I [
49], while graph neural network models include GIN [
10], SLIM [
30], AdaSNN [
25], SEKGIN [
50], and HA-SCN [
51]. Recently proposed hybrid approaches are KerGNN [
22], GKNN [
36], and GCKSVM [
52]. To ensure experimental fairness, comparisons are conducted under identical hardware configurations and default parameter settings.
Parameter settings. The model learning rate is set to 0.01 by default, with a batch size of 32 graphs and 400 iterations. Cross-validation is performed 10 times to estimate the model’s classification accuracy. The number of factor subgraphs is set to 16, the path length for the random walk kernel is 3, and graph pooling defaults to summation. When processing the dataset, the social network dataset uses node degree as the node feature.
Experimental environment. All experiments in this paper are conducted using an NVIDIA RTX 4090D GPU with 24GB of VRAM. The operating system is Ubuntu 16.04.1 LTS 64-bit, and the CPU model is Xeon® Platinum 8481C. Code implementation utilizes Python 3.10, with the deep learning framework being PyTorch 2.2.1.
To evaluate the performance of PSDGNN, its graph classification accuracy is compared with 10 benchmark methods across seven graph datasets. The results are presented in
Table 4, where the best result on each dataset is highlighted in bold, the second-best result is underlined, and “-” indicates that results for that method could not be obtained on that dataset. As shown in
Table 2, PSDGNN demonstrates significant performance advantages on six datasets, except MUTAG. Compared with graph kernel-based methods (e.g., WL, RetGK, and AEGK), PSDGNN achieves notably higher accuracy, confirming its robust subgraph structure modeling capability. SEKGCN, a subgraph-based graph neural network, achieves the second-best overall performance among all benchmark methods, underscoring the importance of subgraph encoding. PSDGNN outperforms SEKGCN because it leverages community subgraph information, whereas SEKGCN relies on fixed neighborhood subgraphs. Compared with kernel-based methods like GKNN, KerGIN, and GCKSVM, PSDGNN also demonstrates superior classification accuracy. It achieves average accuracy improvements of 8% and 10% on MUTAG and PTC, respectively, indicating that incorporating community subgraph information into graph embedding representations can enhance the performance of graph neural networks.
Figure 3 shows the results of PSDGNN across ten-fold cross-validation on seven datasets.
4.2. Ablation Analysis
Module Ablation: To investigate the impact of individual modules on performance, we also conducted ablation studies to evaluate the potential contributions of various components within PSDGNN. Specifically, we compared three variants of PSDGNN: the original model w/o GraphVAE (PSDGNN without the modified GraphVAE), w/o PG (PSDGNN without the prototype subgraph module and prototype kernel embedding module), and GNN (PSDGNN without the modified GraphVAE, with the factor subgraph generation module and the prototype kernel embedding module). Notably, the w/o GraphVAE variant omits the decoding component and model optimization of the enhanced GraphVAE algorithm. As shown in
Figure 4a, our ablation study reveals that PSDGNN consistently outperforms the model w/o PG across all datasets, which suggests that prototype subgraphs help models capture typical features relevant to classification. Furthermore, PSDGNN outperforms the model w/o GraphVAE. This may be because GraphVAE is capable of learning meaningful, interpretable latent representations of graphs.
Loss Ablation: To investigate the impact of various loss functions on performance, an ablation study was conducted. Specifically, we compared two variants of PSDGNN: the original model w/o GraphVAE Loss (removing the GraphVAE loss function), w/o Factor Loss (removing the factor mutual information loss function), and GNN (PSDGNN without GraphVAE modifications, retaining the original subgraph module and the original kernel embedding module). The results of the ablation study are shown in
Figure 4b. Compared with the original model (PSDGNN), the classification accuracy of the variant w/o GraphVAE Loss decreased by approximately 2%. This is because GraphVAE loss helps preserve critical structural information in the input graph, preventing information loss. Compared with the original model (PSDGNN), the version w/o Factor Loss resulted in a classification accuracy decrease of approximately 1%. This is because factor loss enforces independence between different latent variables, ensuring that each factor subgraph corresponds to a unique and non-redundant connectivity pattern.
Effect of Independence Constraints on Graph Classification: In
Table 5, we evaluate three representative paradigms for feature independence constraints: HSIC (kernel-based), distance correlation (distance-based), and factor mutual information (Factor MI) (information-theoretic). Factor MI consistently achieves the best or runner-up performance across all seven datasets, demonstrating the superiority of information-theoretic independence constraints for graph-structured data. HSIC performs competitively on MUTAG (92.9%) and PTC (73.4%), ranking second in most cases, validating kernel-based methods as strong baselines, while distance correlation yields relatively conservative results, suggesting that distance-based independence measures may be less sensitive to structural dependencies in graphs. Based on these results, we adopt Factor MI as the default independence constraint. Although alternative disentanglement strategies such as contrastive objectives (e.g., InfoNCE) or independence-promoting regularizers (e.g., Total Correlation penalty) show potential, the consistent and dominant performance of Factor MI across the seven benchmarks justifies its selection. Exploring these alternatives in graph representation learning remains an interesting direction for future work.
Comparison with Attention-Based Soft Matching. To validate the design choice of strict prototype alignment, we implemented an alternative approach in which each factor graph is matched with all prototype graphs via an attention mechanism. As shown in
Figure 5, strict alignment consistently outperforms the attention mechanism variant across all seven datasets, with an average improvement of 2–4%. We attribute this to two key reasons. First, strict alignment maintains a clear one-to-one correspondence between factor subgraphs and prototype subgraphs, which is crucial to decoupling—each factor is forced to align with a distinct structural prototype, preventing the attention mechanism from blending multiple prototypes into a single factor, thereby preserving interpretability at the factor level. Second, attention-based variants introduce additional parameters and optimization complexity, which may lead to overfitting, particularly on smaller datasets such as MUTAG and PTC, where the performance gap is most pronounced. Therefore, we retain strict prototype alignment as the default mechanism in our framework.
4.3. Parameter Sensitivity Analysis
Number of Factor Subgraphs M: When constructing factor subgraphs, the hyperparameter
M is introduced to control the number of feature groupings for nodes. To achieve optimal model performance, experiments were conducted with different values of
M. As shown in
Figure 6, graph classification experiments were performed with
M ranging from 2 to 12 in increments of 0.2. Experimental results indicate the following: On the MUTAG, PROTEINS, IMDB-B, and IMDB-M datasets, classification accuracy peaks when
M is approximately 4, while on the PTC and NCI1 datasets, the highest accuracy is achieved when M is approximately 6. In summary, the optimal range for the parameter is between 4 and 8. This is primarily because the features of nodes in different datasets contain varying amounts of information; an excessively large value of
M introduces noise and weakens the discriminative power of the factor subgraph, while an excessively small value of
M prevents the model from fully capturing factor subgraphs at different levels, thereby reducing the representational capacity of the factor subgraph.
Number of Graph Encoder Layers: To investigate the impact of encoder layer count on experimental performance, we examined how graph classification accuracy evolves across training epochs for PSDGNN’s GraphVAE module with varying graph encoder layers across six datasets. As shown in
Figure 7, with layer sizes ranging from 1 to 4, across all datasets, the best performance was achieved with two layers. When the number of layers exceeded two, the accuracy of graph classification began to decline. This is determined by the graph structure: when the number of layers is excessive, the aggregated information from the neighborhoods of individual nodes becomes too similar, leading to a decrease in the model’s classification accuracy. This phenomenon, commonly observed in message-passing neural networks, is known as the oversmoothing problem. However, the introduction of the factor subgraph mutual information loss function into this model confers a degree of resistance to oversmoothing. Consequently, while the number of layers increases, the decline in accuracy is less pronounced.
Weighting Hyperparameters in Loss Functions: During model optimization, this paper introduces hyperparameters
and
representing the weights for GraphVAE loss and factor mutual information loss, respectively. This experiment investigates how varying these hyperparameters affects classification accuracy across three graph classification datasets. As shown in
Figure 8, on the MUTAG dataset, the highest classification accuracy is achieved when
and
. On the DD dataset, the highest classification accuracy is achieved when
and
. On the Pubmed dataset, the highest classification accuracy is achieved when
is set to 0.2 and
is set to 0.6. This indicates that the two loss functions can provide PSDGNN with self-supervised information, thereby enabling the subgraph representation to distinguish between different graph instances. The proposed subgraph disentangled mechanism can extract different key information from various types of datasets through these two loss functions.
The Path Length P of the Random Walk Kernel: For the prototype kernel embedding module, we utilize random walk kernels to compute the similarity between factor subgraphs and prototype subgraphs. To investigate the impact of path length
P for different random walk kernels on model performance, we evaluated PSDGNN with varying path lengths
P across six datasets. As shown in
Figure 9, on the MUTAG, PTC, PROTEINS, and NCI1 datasets, PSDGNN performs best when
or
, whereas on the IMDB-B and IMDB-M datasets, PSDGNN performs best when
. This suggests that the optimal path length in social networks has a higher order than that in bioinformatics graphs. It is reasonable because the diameters of most of the basic functional blocks in molecules are around 2 to 3 and long-range dependency also plays an important role in social networks.
The Size of the Prototype Subgraph: In our framework, each learnable prototype subgraph is parameterized by a trainable adjacency matrix and a trainable feature matrix, with the prototype subgraph size matching that of the factor subgraph. To investigate the impact of different prototype subgraph sizes on model performance, we conducted experiments with the number of prototype subgraphs ranging from 2 to 16. As shown in
Figure 10, the optimal prototype size varies substantially across datasets, reflecting differences in graph complexity and structural diversity. For small molecular graphs such as MUTAG and PTC, which have average node counts of 17 and 19 respectively, the best performance is achieved with a prototype size of 4. Smaller prototypes (e.g., size 2) lack sufficient capacity to capture discriminative functional groups such as nitro groups or aromatic rings, while larger prototypes (e.g., size 8 or above) introduce spurious structures that lead to overfitting. For medium-sized graphs including PROTEINS and IMDB-B, which have average node counts of 20 to 25, the optimal prototype size increases to 6. This larger capacity allows the model to capture more complex structural motifs, such as secondary protein structures or dense collaboration communities. For the largest datasets in our study—DD, NCI1, and IMDB-M—where average node counts range from 30 to nearly 70, the optimal prototype size ranges from 8 to 10. These larger prototypes are necessary to encode the diverse and elaborate topological patterns present in large molecular compounds, protein enzymes, and multi-genre movie collaboration networks. Notably, when the prototype size is increased beyond 12, performance degrades consistently across all datasets due to increased model complexity and sparsity in the similarity matching between factor subgraphs and prototypes. Conversely, when the prototype size is too small (e.g., size 2), the prototypes lack sufficient expressive power to capture the distinguishing substructures of each class.
4.4. Visualization Analysis
To assess the discriminative power of PSDGNN, we use t-SNE to project the high-dimensional graph embeddings into 2D space. As shown in
Figure 11, the embeddings of different classes form well-separated clusters on MUTAG, PROTEINS, and DD, indicating that PSDGNN successfully learns class-discriminative representations.
To illustrate the disentanglement process more clearly, we present an example of the factor subgraph generated by PSDGNN.
Figure 12 shows the original disentangled factor subgraph. After setting the coefficient threshold on the original graph, the edges of the disentangled factor subgraph are visually highlighted with different colors. For instance, in the MUTAG dataset, the task is to predict the mutagenicity of a molecule against a set of nitroaromatic compounds in Salmonella typhimurium. We can observe that different parts of the molecular graph play distinct roles in the prediction. This also demonstrates the reliability of our generated factor subgraph when obtaining a disentangled graphical representation.
5. Conclusions
This paper proposes Prototype Subgraph Disentangled Graph Neural Network (PSDGNN), which enhances interpretability by decomposing input graph features into latent factor subgraphs and aligning them with learnable prototype subgraph via stochastic tour kernels. Concurrently, it introduces factor mutual information loss to encourage learning distinct latent connectivity patterns across factor subgraphs, thereby improving model generalization.
Experimental results on seven public benchmark datasets demonstrate that PSDGNN consistently outperforms existing methods, achieving competitive performance across all datasets. Notably, PSDGNN achieves a 5.5% accuracy improvement over the best baseline on DD and a 4.2% improvement on IMDB-B, validating the effectiveness of its disentanglement and prototype alignment mechanisms. On average, PSDGNN outperforms traditional graph kernels by 8–10% and GNN-based methods by 3–5% on challenging datasets. Beyond accuracy, PSDGNN provides structured interpretability through visualizable factor subgraphs and prototype subgraphs, and effectively mitigates the oversmoothing problem in GNNs.
Future work will focus on exploring composition methods based on other graph kernel approaches and investigating the model’s potential applications in other graph tasks, such as node classification and link prediction. In addition, the structured factor subgraphs disentangled by our method have the potential to provide interpretable substructure priors for large-model-based graph reasoning tasks. Exploring the integration of factor subgraphs with large models is a promising direction for future work.