1. Introduction
Graph theory provides a fundamental framework for the mathematical modeling of discrete structures representing entities and their relationships. Graph G = (V, E) consists of a set of nodes V and a set of edges E, which represent binary relationships between the nodes. Traditional graph theory studies have primarily focused on homogeneous graphs, where all nodes and edges are of the same type [
1]. However, many real-world problems require heterogeneous graph structures, that is, systems containing multiple node types and differentiated edge relationships. Heterogeneous graphs, also known as multi-relational or typed graphs, extend the classical definition of a graph by assigning type information to nodes and edges [
2]. In these structures, each node and edge belongs to a predefined set of types and is associated with type-matching functions. The combinatorial properties of heterogeneous graphs, such as chromatic number, independence count, cross-partition matching, connectedness, and graph parameters, are fundamentally different from those of homogeneous graphs [
3]. K-partite heterogeneous graphs exhibit a special structure in which the node set is partitioned into disjoint sets, with edges connecting only nodes across these sets [
4]. Such graphs can be used to naturally model social networks, knowledge graphs, and multi-level hierarchical organizations (such as student–school–country or patient–hospital–region). While similar multi-level structures appear in diverse fields, including healthcare and organizational analysis, our empirical findings should not be interpreted as directly transferable to domains whose graph topologies differ fundamentally from the sparse, low-diameter tripartite structure examined in this study. The methodological framework remains broadly applicable, but architectural conclusions are topology dependent. For example, recent work on g-good-neighbor diagnosability in multiprocessor networks highlights how fault propagation patterns depend on hierarchical graph topology [
5,
6]. Hierarchical stochastic network models have also been proposed to assess fault behavior in multi-level industrial processes [
7].
Analyzing heterogeneous graph structures gives rise to various combinatorial optimization problems. For example, problems such as node attribute prediction in multi-level structures, heterogeneous neighborhood aggregation, and modeling inter-partition information flow are too complex to be addressed by classical graph algorithms. These problems require significant computational resources, especially in large-scale graph structures, and even sub-problems such as hyperparameter selection can be inefficient without systematic optimization strategies. Therefore, developing computationally efficient algorithms for heterogeneous graph structures is an important area of research in graph theory. However, despite growing interest in graph-based learning, the integration of optimization strategies with heterogeneous and multi-level graph representations remains largely unexplored. This represents a critical methodological gap that this study seeks to address.
In recent years, the integration of graph theory with computational methods has marked the beginning of a new era in theoretical and applied research. Graph machine learning (GML) focuses on developing and applying computational algorithms based on graph structures. Unlike classical tabular data methods, this field exploits the combinatorial properties of graphs to uncover hidden structural patterns by modeling the relational dependencies between nodes. Graph neural networks (GNNs) have become one of the most important tools in graph-based computational methods. Based on message passing and state updating mechanisms, GNNs model the flow of information between nodes [
8]. These mechanisms enable each node to update its representation by integrating information from neighboring nodes, as well as its own features. This facilitates learning at the node, edge, and graph levels. From a graph theory perspective, GNNs can be viewed as recursive functions defined on the graph. Architectural choices, such as neighborhood aggregation strategies and the number of message-passing layers, influence the model’s ability to capture structural properties. For instance, the number of layers determines the receptive field size (K-hop neighborhood), and different aggregator functions (e.g., mean, maximum, or attention-based) may respond differently to varying degree distributions. In heterogeneous, multi-level graphs, these architectural decisions are particularly critical, as information must propagate across partitions with distinct structural characteristics. Various GNN architectures have been developed in the literature, each with distinct theoretical foundations and computational characteristics. The Graph Convolutional Network (GCN) aggregates neighbor information via spectral-based convolution filters, leveraging the spectral properties of the graph Laplacian matrix [
9]. The Graph Attention Network (GAT) employs an attention mechanism to dynamically assign importance weights to neighboring nodes, enabling adaptive aggregation based on learned attention coefficients [
10]. The Graph Isomorphism Network (GIN) achieves high expressive power by implementing aggregation functions equivalent to the Weisfeiler–Lehman graph isomorphism test. This enables the model to distinguish between diverse graph structures [
11]. GraphSAGE addresses the challenge of scalability in large-scale graphs by sampling neighborhoods, thus enabling inductive learning on previously unseen nodes [
12]. Theoretically, these architectures can be extended to heterogeneous, multi-level graphs by incorporating type-specific parameters for different nodes and edges. However, optimizing these models—including hyperparameters such as embedding dimensions, learning rates, and the number of message-passing layers—remains problem-dependent and requires systematic optimization strategies. Despite the theoretical extensibility of GNN architectures to heterogeneous settings, most existing studies have focused on homogeneous, single-level graphs. Systematic investigations of the performance of heterogeneous GNNs on multi-level relational structures are scarce, as is the integration of such GNNs with principled hyperparameter optimization frameworks, such as Bayesian optimization. Multi-level heterogeneous graphs introduce substantial combinatorial complexity due to differentiated node types, asymmetric inter-partition connections, and partition-specific structural properties. Such structures are prevalent in education (student–school–country), healthcare (patient–hospital–region), and the social sciences, where relationships inherently span multiple hierarchical levels. Consequently, the development of Bayesian-optimized heterogeneous GNN frameworks for multi-level data represents a significant gap in both graph theory and computational methods. Recent years have witnessed the development of several heterogeneous GNN architectures. Zhang et al. [
13] developed the HetGNN model to learn node representations in heterogeneous graphs, and encoded content interactions by sampling heterogeneous neighbors. Yang et al. [
14] proposed the SeHGNN model to simplify overly complex attention mechanisms and provided wider neighborhood interactions by using long meta paths. Zhu et al. [
15] presented the RSHN model to jointly learn node and edge representations in heterogeneous graphs and captured edge relationships with a coarsened line graph approach. In addition to these developments, Zhu et al. [
16] introduced a high-order topology-enhanced graph convolutional network that incorporates multiscale structural information to improve representation learning on dynamic graphs. Their formulation demonstrates how higher-order and multi-level topological dependencies can be leveraged to strengthen message passing, providing insights relevant to graph partitioning and multiscale graph theory. The performance of computational methods on heterogeneous graph structures is highly sensitive to hyperparameter selection. Parameters such as the learning rate, the number of layers, the embedding dimension, and the aggregator type can have a direct impact on the model’s generalizability. Classical grid-search or random search methods are computationally inefficient in high-dimensional parameter spaces. Bayesian optimization offers a systematic solution to this problem based on probabilistic modeling. Acquisition function-based approaches (e.g., Expected Improvement and Upper Confidence Bound) enable convergence to optimal parameters with a limited number of trials [
17,
18]. Although this method is widely used in machine learning literature, it has rarely been applied to graph-based computational problems, particularly multi-level heterogeneous graph structures. This study aims to investigate the theoretical properties of heterogeneous, multilevel graph structures, and to develop computational algorithms supported by Bayesian optimization on these structures. The TIMSS 2023 eighth-grade mathematics dataset was selected as a case study because it naturally represents a three-partite heterogeneous graph consisting of student, school, and country levels. While only a limited number of studies have combined Bayesian optimization methods with GNN architectures [
16,
19], these works have predominantly focused on engineering and bioinformatics applications rather than on multi-level or educational data modeling. The multilevel hierarchical structure inherent in the educational data (e.g., students, schools, countries) reveals unique combinatorial properties when considered from the perspective of heterogeneous graph theory. In particular, the modeling of information flow between partitions in a three-partite graph structure, and its sensitivity to hyperparameter selection, has not been systematically investigated in the literature. Accordingly, the methodological contributions of this study focus on the following: theoretical formulation of a three-partition graph structure consisting of student, school, and country levels, analyzing the edge structures and graph parameters across partitions; systematic comparison of GraphSAGE, GCN, GAT, and GIN architectures on heterogeneous structures and examining the performance characteristics of each with respect to structural features; application of acquisition function-based Bayesian optimization to the hyperparameter search problem of graph-based computational algorithms and demonstration of its effectiveness compared to classical methods. This integrated approach provides a methodological framework for the theoretical modeling of heterogeneous multilevel graph structures, the design of computational algorithms on these structures, and the integration of Bayesian optimization methods to graph problems. The educational data serves as a concrete example to demonstrate the applicability of this framework. This integration bridges the theoretical foundations of heterogeneous graph theory with probabilistic optimization, establishing a unified framework for multi-level relational modeling.
4. Results
The TIMSS 2023 eighth-grade mathematics achievement data was subjected to a comprehensive modeling process that considered its multilevel structure. Three modeling approaches were evaluated: baseline models (linear regression, ridge regression, Random Forest, XGBoost, SVR, and CatBoost), heterogeneous GNN models (GCN, GAT, GIN, and GraphSAGE), and GNN models optimized with Bayesian optimization. All models were applied to the same dataset, and their predictive performance was compared using the R
2, RMSE, and MAE metrics. In the second stage, the model with the best performance was analyzed using the GNNExplainer method to improve interpretability. Baseline models were constructed using student-level demographic and home environment variables, as well as school- and country-level contextual variables (as fixed effects). The data were randomly split at the student level into 70% training data and 30% test data. Missing values were imputed using the median; categorical variables were dummy-coded; and numerical variables were z-score standardized. To prevent data leakage, all preprocessing steps were applied only to the training data, with transformation applied to the test data. To account for the multistage sampling design of TIMSS, the total sampling weight (TOTWGT) was used as the case weight in all analyses. Hyperparameter tuning was performed using systematic grid search and 3-fold cross-validation.
Table 2 shows the optimized hyperparameters of the baseline models.
Different configurations were tested in terms of model depth and embedding size when training heterogeneous GNN models (GCN, GAT, GIN and GraphSAGE). Initially, an embedding size of H = 192 and two- to three-layer structures were employed. This value produced consistent results in preliminary experiments and was found to align with settings commonly employed in GNN literature [
12]. Dropout regularization was applied to all GNN models to prevent overfitting, and the Adam optimization algorithm [
46] was used. During model training, an early stopping strategy based on RMSE of the validation set was employed to enhance generalization. Then, the optimal hyperparameters for each GNN model were determined using Bayesian optimization. A systematic search was performed for the learning rate, dropout rate, number of layers, embedding size (H) and weight decay parameters, using a Gaussian process-based expected improvement (EI) gain function. The best hyperparameter combinations for each model were obtained by minimizing the root mean square error (RMSE) on the validation set.
Table 3 shows the optimal hyperparameters obtained through Bayesian optimization. Notably, the embedding size (H) values varied between 71 and 192 across models, suggesting that different GNN architectures require distinct representation capacities. Notably, smaller embedding sizes were found for the GCN_Bayesian and GraphSAGE_Bayesian models compared to the H = 192 value used in preliminary experiments. This reduction suggests that better generalization performance can be achieved by optimizing the hyperparameters together to create a more compact representation [
17].
As shown in
Table 3, the optimal number of layers for all heterogeneous GNN models was found to be two. This result has a rigorous graph-theoretical justification: in the tripartite graph with structure
the diameter of the underlying undirected graph equals 2. Thus, any student node can reach country-level information in exactly two hops. Since a
-layer GNN aggregates information from the
-hop neighborhood under standard message passing [
44], two layers are both necessary (to reach the full 2-hop receptive field) and sufficient (as deeper layers do not expand the receptive field but instead cause over-smoothing [
45,
47]
Deeper architectures (three to four layers) exhibited lower validation performance, confirming the over-smoothing effect, where repeated aggregation reduces representational discriminability.
For each architecture, the cost of a single message-passing layer depends on the number of nodes
, edges
, and embedding dimension
. Using sparse adjacency matrices, the per-layer time complexity of GCN, GIN, and GraphSAGE with mean aggregation is
where the dominant cost arises from sparse matrix–vector multiplication along edges and linear transformations at nodes. In contrast, GAT requires computing attention coefficients and weighted messages for every edge; with
attention heads, the per-layer complexity becomes
which is typically more expensive in both computation and memory.
In the tripartite TIMSS graph used in this study, edges exist only between students, schools, and countries (no intra-partition edges). As a result, grows approximately linearly with , and node degrees remain bounded. This sparse structure enables nearly linear scaling for aggregation-based architectures (GCN, GIN, GraphSAGE), whereas attention-based GAT incurs additional overhead due to edge-level attention computations.
Table 4 shows a comparison of the performance of the test sets for the baseline models, the heterogeneous GNN models and the Bayesian-optimized GNN models. All performance metrics were calculated by applying the TIMSS sampling weights (TOTWGT).
As shown in
Table 4, linear methods (Linear Regression and Ridge Regression) exhibited relatively high error metrics, with R
2 values around 0.52–0.53 and RMSE values exceeding 80. In contrast, ensemble methods (XGBoost, CatBoost, and LightGBM) outperformed these models, with XGBoost achieving the best baseline performance (R
2 = 0.5786, RMSE = 74.87).
When examining the performance of heterogeneous GNN models before Bayesian optimization, significant differences are observed across architectures. The GCN model (R2 = 0.5153, RMSE = 86.66) performed lower than baseline methods, falling behind even linear regression models. This weak performance stems from the spectral convolution mechanism failing to adapt to the tripartite graph structure with initial hyperparameter configurations, particularly the embedding dimension H = 192. The subsequent determination of the optimal embedding size for GCN as H = 71 through Bayesian optimization confirms that the initial configuration was unsuitable for this architecture. In contrast, GAT (R2 = 0.5593, RMSE = 83.35) and GIN (R2 = 0.5646, RMSE = 82.94) models demonstrated substantially better performance than linear baseline models (R2 ≈ 0.52–0.53), with accuracy levels approaching ensemble methods. These results represent acceptable performance for educational data in literature. Most notably, the GraphSAGE model exhibited strong performance even without Bayesian optimization (R2 = 0.5798, RMSE = 74.85, MAE = 58.24)—outperforming all baseline methods and achieving results nearly identical to XGBoost. This exceptional robustness stems from GraphSAGE’s architectural design, which is inherently well-suited to hierarchical graph structures. The fixed-size neighborhood sampling mechanism manages information flow across school nodes with heterogeneous student enrollments (in-degree: 1–60, mean: 12.29), while the concatenation-based aggregation strategy creates representations more resilient to hyperparameter misspecification compared to spectral convolution or attention mechanisms. Significant performance improvements were observed in all heterogeneous GNN models following Bayesian optimization. The improvement was particularly dramatic for GCN (R2 = 0.5153 → 0.5845, RMSE = 86.66 → 71.38), demonstrating the critical importance of hyperparameter optimization for convolution-based architectures. GAT and GIN models achieved similar gains, with error metrics converging to RMSE ≈ 71–72 and surpassing all baseline methods.
The GraphSAGE_Bayesian model achieved the highest explanatory power among all models (R2 = 0.6187, RMSE = 71.72, MAE = 64.32), surpassing the best baseline model (XGBoost: R2 = 0.5786) and demonstrating the best modeling capacity of graph-based representations on multi-level educational data. The strong performance of GraphSAGE was further enhanced through Bayesian optimization, indicating that systematic hyperparameter search can improve performance beyond the architecture’s inherent robustness.
The Bayesian optimization process identified architecture-specific optimal hyperparameters tailored to the tripartite structure (
Table 3). Notably, two-layer architecture achieved the highest accuracy across all models, consistent with the tripartite graph’s characteristics: two message-passing layers provide sufficient information flow for student nodes to access contextual information at school and country levels. Deeper architectures (3–4 layers) caused over-smoothing problems on the validation set and exhibited lower performance.
These findings demonstrate that architecture selection and hyperparameter optimization must be considered jointly in heterogeneous GNN models. While robust architectures like GraphSAGE perform well initially, spectral methods like GCN reach their potential only through systematic optimization. Consequently, heterogeneous graph representations combined with Bayesian optimization outperform fixed-effects models on multi-level educational data.
The higher explained variance of GraphSAGE_Bayesian is directly related to how the graph structure represents multi-level relationships. Traditional regression models explain student achievement through variable coefficients and represent school and country contexts as fixed effects, capturing complex relationships only to a limited extent. The GraphSAGE model, conversely, incorporates school- and country-level contextual information alongside individual student characteristics into the graph structure via learnable embeddings. However, while GraphSAGE provides strong predictive performance, it does not directly indicate which features are most critical. To address this, the best-performing GraphSAGE_Bayesian model was analyzed using GNNExplainer.
To address this, the best-performing GraphSAGE_Bayesian model was analyzed using GNNExplainer. GNNExplainer calculated the relative importance of each variable using weighted gradient-saliency scores. Since these scores are derived from the model’s gradient sensitivities, significance should be evaluated based on variable rankings and cumulative contributions [
38].
Figure 2 shows the top 20 student-level variables with the highest contribution of all students.
The findings indicate that the most influential determinants of student achievement are home learning resources, parental education level, and access to digital learning opportunities. Key indicators include availability of personal study space (own room and desk), access to technology (computer, smartphone, internet), number of books at home, and parental education level—all markers of socioeconomic status. Internet use for educational purposes showed strong associations with achievement, as did motivational and psychosocial factors such as valuing mathematics achievement, discussing environmental issues, and perceiving school safety.
While GNNExplainer analyzed only student nodes, school- and country-level contextual information was incorporated indirectly through GraphSAGE’s message-passing mechanism. Thus, the explainer does not isolate the importance of school or country nodes explicitly, but these contextual effects are reflected implicitly in the learned student representations. This explains why two students with similar socioeconomic characteristics may receive different predictions depending on the school and country to which they are connected. Whereas traditional methods encode school and country contexts as fixed dummy variables, GraphSAGE represents them as learnable embeddings interacting with student features.
Analyses were also conducted for low- and high-achieving student groups.
Figure 3 presents the top 10 variables with the highest contribution for both groups.
Figure 3 shows that predictive factors for low-achieving students relate more closely to home environment and digital access opportunities: internet access, online collaboration with classmates, access to shared computing facilities, and language spoken at home. In contrast, predictive factors for high-achieving students include parental education level, attitudes toward mathematics achievement, and home educational resources (see
Appendix B). Therefore, low-achieving students’ performance is more strongly explained by basic digital access, while high-achieving students’ performance is more strongly explained by parental education and cultural capital indicators. These results align with findings in the TIMSS literature [
43,
48,
49].
5. Conclusions and Discussion
This study presents a methodological framework for using heterogeneous graph neural networks in modeling multilevel educational data. Hierarchical structures where students are linked to schools and schools to countries have long been a fundamental analytical challenge in educational research. While traditional approaches address this structure, they confine school and country effects to predetermined parametric forms. The graph-based approach proposed in this study aims to overcome this limitation by incorporating school and country contexts into the model for learnable node representation. TIMSS 2023 eighth-grade mathematics achievement data was used to assess the validity of the methodology. A tripartite graph structure consisting of 10,000 students, 789 schools, and 25 countries were created. Four different heterogeneous GNN architectures (GCN, GAT, GIN, GraphSAGE) were evaluated, and optimal hyperparameters were determined for each using Bayesian optimization. The results demonstrate that heterogeneous GNN architecture exhibits distinct performance characteristics. Even without Bayesian optimization, the GraphSAGE model outperformed all baseline methods, performing competitively with the strongest baseline model (XGBoost; R
2 = 0.5786 vs. GraphSAGE; R
2 = 0.5798). This robustness stems from GraphSAGE’s sampling-based aggregation mechanism, naturally fitting hierarchical training data. The GAT and GIN models also significantly outperformed linear methods. In contrast, the GCN model initially performed poorly but surpassed all baseline methods following Bayesian optimization. These findings emphasize the importance of both hyperparameter and architecture selection. Following Bayesian optimization, GraphSAGE_Bayesian achieved the highest explained variance (R
2 = 0.6187), while GCN_Bayesian achieved marginally lower RMSE (71.38 vs. 71.72) and MAE (64.21 vs. 64.32). These differences are practically negligible on the TIMSS scale, and both models substantially outperformed baseline methods. The optimal number of layers in all GNN models is 2, which is consistent with the topological properties of the tripartite graph structure. Two message-passing layers provide sufficient information flow for student nodes to access contextual information at school and country levels. GNNExplainer analyses further clarify the modeling behavior of the graph structure. Individual characteristics such as parental education level, number of books at home, and access to digital resources emerged as strong predictors, as expected. Since GNNExplainer was applied only to student nodes, the explanations primarily reflect student-level features. However, the model’s predictions also vary across students with similar individual characteristics, which indicates that contextual information from school and country nodes is incorporated indirectly through GraphSAGE’s message-passing mechanism, even though this influence is not isolated explicitly by the explainer. The fact that digital access and basic resources are more significant for low-achieving students, and parental education and cultural capital are more significant for high-achieving students, offers important insights into the design of targeted interventions for different student groups. The original contribution of this study lies in the systematic application of heterogeneous graph neural networks to multi-level educational data and the comparative evaluation of different GNN architectures’ performance characteristics. While traditional multi-level modeling approaches have been successful in representing hierarchical educational data within parametric structures, they constrain school and country effects through predetermined functional forms [
50,
51]. This study empirically demonstrates how graph-based representations can overcome this limitation. Notably, GraphSAGE’s strong performance even without hyperparameter optimization constitutes an original finding that reveals this architecture’s natural compatibility with hierarchical educational data. The sampling-based aggregation mechanism proposed by Hamilton et al. [
12] demonstrated more robust performance in tripartite educational data compared to spectral convolution [
9] or attention mechanisms [
10], emphasizing the importance of architectural design alignment with data structure.
The necessity of jointly considering architecture selection and hyperparameter optimization provides an important methodological insight for the effective use of graph-based models in educational research. The Bayesian optimization process [
17], identifying different optimal hyperparameters for each architecture, demonstrates that GNN models require dataset-specific tuning. A notable implication of these results is that heterogeneous GNNs do not automatically outperform strong tabular baselines on multi-level educational data. In our study, the initial GraphSAGE model performed statistically equivalently to XGBoost (R
2 ≈ 0.58), which reflects the robustness of tree-based ensemble methods and the well-known sensitivity of GNNs to architectural choices such as embedding size, number of layers, and aggregation strategy. However, once hyperparameters were systematically optimized through Bayesian optimization, the GraphSAGE model achieved substantially higher explained variance (R
2 = 0.6187), revealing modeling capacity that remained latent under default settings. This demonstrates that the advantage of graph-based representations is not inherent but emerges only when hyperparameters are aligned with the structural properties of the tripartite educational graph. Importantly, the generalizability of empirical findings should be interpreted with care. The conclusions drawn from this study—such as the suitability of two message-passing layers—reflect the specific structural properties of the TIMSS graph, which is sparse, low-diameter, and strictly hierarchical. These results are most directly transferable to other multi-level or k-partite systems that exhibit comparable topological characteristics. Broader applications to domains with substantially different graph structures (e.g., dense, highly cyclic, or non-hierarchical networks) would require separate empirical validation rather than direct extrapolation. In this respect, our methodological framework is general, but the numerical findings are context dependent. Accordingly, our contribution lies not only in applying heterogeneous GNNs to educational data, but also in providing a systematic optimization framework that enables these models to fully exploit multi-level relational structures.
Furthermore, the direct relationship between the optimal number of layers and the graph diameter in the tripartite structure (diameter = 2 → optimal layers = 2) empirically validates the link between graph topology and deep learning architecture. This result is consistent with the over-smoothing behavior observed in deeper GNN models, which became apparent after two layers in our structure [
45]. The GNNExplainer analysis further illustrates how contextual information from school and country nodes is integrated into student representations—even though explanations were computed only for student nodes—demonstrating that the model captures multi-level dependencies implicitly through message passing.
An important limitation concerns the scope of generalizability of our empirical findings. While the Bayesian optimization framework introduced in this study is methodologically applicable to a broad range of heterogeneous or multi-level graph settings, the specific architectural results—such as the optimality of a two-layer message-passing architecture—are inherently tied to the structural characteristics of the TIMSS tripartite graph (diameter ≈ 2, sparse hierarchical connectivity). These findings may reasonably extend to other domains that exhibit comparable layered multi-level structures (e.g., student–school–country; employee–department–company; patient–clinic–hospital). However, they should not be directly extrapolated to domains whose topologies differ fundamentally, such as dense biological interaction networks or scale-free protein systems, where optimal depth and aggregation behavior follow different structural dynamics.
Taken together, these results show that heterogeneous GNNs offer meaningful advantages over traditional multi-level modeling approaches from both methodological and practical perspectives. For education policymakers and school administrators, this framework provides a valuable tool for early identification of at-risk student groups, as well as a deeper understanding of contextual effects and the design of targeted interventions. GraphSAGE’s robustness also indicates that effective models can be constructed even under limited computational resources, which is particularly advantageous in applied educational settings.
Future work should extend this framework to heterogeneous graphs with diverse topological properties, including bipartite, k-partite, and dynamic graph structures. Systematically evaluating how structural features—such as graph diameter, degree distribution, and clustering coefficient—affect optimal hyperparameters and model performance would help define the broader applicability and limitations of heterogeneous GNNs in complex relational systems.