1. Introduction
Predicting learner performance is a long-standing research problem in educational data mining and is currently one of the main application scenarios for interpretability in this field [
1]. To develop personalized intervention strategies, schools and educational institutions must gain an in-depth understanding of learners’ academic performance and behavioral characteristics. However, due to differences in individual needs and interests, learners’ engagement styles and learning paths are highly diverse. While these diverse learning behaviors provide a wealth of data for online education, they also present significant challenges in terms of analyzing the learning process and predicting student performance [
2,
3,
4]. Accurate analysis of student performance can help instructors identify those at risk of failing or dropping out, provide timely feedback and intervention, and develop personalized learning paths and resources. It also enables learners to monitor their own progress, adjust their learning strategies, and enhance their motivation and self-regulation skills.
Both domestic and international researchers have conducted extensive and in-depth studies on the predictive analysis of learning outcomes, aiming to uncover underlying patterns in learners’ academic performance through educational data mining techniques. In the field of traditional machine learning, for example, Riestra et al. [
5] developed a learning outcome prediction model based on five algorithms: logistic regression, support vector machines, naïve Bayes, decision trees, and multi-layer perceptions. By analyzing a large volume of LMS log data, they achieved early and accurate predictions of learners’ academic performance. Of the two models designed by Alshabandar et al. [
6], Random Forests performed best in terms of regular academic performance, while Gradient Boosting Machines performed best in terms of final academic achievement. Lu et al. [
7] proposed a model based on dynamic time warping and developed the Dynamic Time Warping with Black Widow Optimization (DTBW) and the Dynamic Time Warping with Fox Optimization (DTFO) models, which achieved accuracy rates of 93.7% and 93.5%, respectively, on public datasets. The introduction of deep learning technologies has further driven the development of this field. Wang et al. [
8] proposed a hybrid model that combines the Long Short-Term Memory (LSTM) with DistilBERT, achieving an accuracy rate of 98.7% whilst optimizing computational efficiency. Meanwhile, Junejo et al. [
9] designed a deep learning model for predicting academic performance tailored to learning scenarios during the pandemic. This model outperformed existing methods on multiple metrics, demonstrating excellent predictive performance. Li Mengying et al. [
10] improved the accuracy of personalized predictions by identifying key features using a dual-path attention mechanism.
Although significant advances have been made using the aforementioned methods, there are still several limitations. Firstly, many models fail to leverage the structural information and attribute feature spaces inherent in learner interactions during feature extraction. This means they overlook the temporal dynamics and long-term dependencies of learning behavior [
11]. Secondly, existing studies often treat learners as independent entities, failing to explore the complex relationships between them in sufficient depth.
In real-world online learning environments, educational data is predominantly sourced from Learning Management System (LMS) logs. However, as smart education and the ‘Smart City’ concept develop, educational settings are gradually adopting multi-source data collection methods based on the Internet of Things (IoT). For example, smart classroom equipment and learning behavior sensors can be used to track learners’ behavioral patterns and interaction modes more comprehensively, thereby providing richer datasets for modeling learning states. However, introducing multi-source, heterogeneous data increases the complexity of data structures and modeling. Graph neural networks (GNNs) are powerful tools for modeling structured data and provide novel solutions for learning performance prediction. GNNs can fully leverage node features and graph structures by recursively aggregating information from neighboring nodes to perform advanced node classification tasks [
12]. This allows them to capture the complex relationships between learners and content. In recent years, the use of GNNs in education has continued to grow. For example, Wang et al. [
13] constructed a heterogeneous graph that integrated text content and knowledge point associations. They achieved a more comprehensive problem representation through a heterogeneous graph neural network (HGNN). This demonstrated the effectiveness of multidimensional structural integration for educational data modeling. Fang et al. [
14] constructed a multi-level knowledge graph based on multi-source data, such as student academic performance and course selection history. They used GCN to mine latent associations, laying a structural basis for personalized recommendations. Fan et al. [
15] employed an enhanced GCN to learn heterogeneous graph embeddings. They combined this with multi-task learning. This improved the accuracy of MOOC recommendations. This further validated the value of multi-topological/heterogeneous graph structures in educational scenarios. However, current GNN models exhibit significant limitations. Traditional GCNs rely on a single topological structure, which makes it difficult to capture multidimensional learning interactions comprehensively. Multimodal Graph Attention Network for Recommendation (MGAT) incorporates an attention mechanism but does not fully account for the complementarity of different topological structures. It also overlooks the temporal dynamics of learning behavior [
16]. Dual Graph Ensemble Learning Method for Knowledge Tracing (DGEKT), proposed by Cui et al. [
17], adopts a dual-graph structure; nonetheless, its core focuses on modeling knowledge associations. This makes it difficult to adapt to the requirements of integrating multi-perspective behavioral relationships in learning progress prediction. Similarly, although heterogeneous GNN models, such as Graph-based Knowledge Tracing for Performance Recommendation (GKTPR, proposed by Zhang et al. [
18]), consider multi-entity associations, they focus on the joint task of knowledge tracking and path recommendation. None of these models have been optimized for feature weight allocation and multi-graph fusion in learning progress prediction. Furthermore, the GCN-SynDCL method put forward by Achari et al. [
19] enhances the graph structure by synthesizing minority nodes to address the common issue of class imbalance in node classification. Yet, it does not address the core requirement of multi-topology fusion and thus fails to overcome the limitations of single-topology representation. In recent years, scholars have begun to focus on modeling the temporal dimension. Xia et al. [
20] proposed a spatiotemporal GNN model that effectively captured the temporal dependencies of learning trajectories on the ASSISTments dataset. Previous research has primarily concentrated on modeling static or stage-based features, frequently overlooking the temporal dynamics of learning behavior and its long-term dependencies. Studies indicate that incorporating time-aware mechanisms can improve modeling capabilities for sequential data and more accurately depict the evolution of learning states. Therefore, future research is expected to improve the model’s dynamic predictive capabilities by further integrating temporal information into learner state transition modeling.
In summary, existing methods face three major challenges. First, it is difficult for a single topological structure to comprehensively characterize the multidimensional interactions among learners. Second, the weighting scheme for features and topological information is not sufficiently refined, which leads to redundant information that interferes with prediction results. Third, structural information and dynamic temporal information have not been effectively combined. To address this issue, this paper proposes a Multi-Topology Graph Convolutional Network based on an attention mechanism (A-MTGCN). Unlike graph attention networks (GATs), which focus on node-level attention within a single topology, and heterogeneous graph neural networks, which rely on predefined relationship types, the A-MTGCN dynamically integrates multidimensional learner relationships from various graph structures through an adaptive attention mechanism. This design allows the model to capture context-dependent interactions more effectively than static, multi-view, graph neural network frameworks employing fixed fusion strategies. In this paper, we use attention mechanisms to differentiate feature weights and achieve complementary structural information through multi-topological fusion. This reduces redundancy and enhances generalization capabilities. The core innovations of this method include (1) using an attention mechanism to adaptively assign feature weights and enhance the contribution of key information; (2) constructing a multidimensional topological graph to capture interaction patterns comprehensively; and (3) integrating features from multiple relationship graphs to reduce structural redundancy, preserve multi-level information, and improve prediction accuracy and model generalization.
4. Experimental Results and Analysis
4.1. Dataset Construction
This study analyzes the learning data of 4400 students from over 300 universities nationwide who participated in massive open online courses (MOOCs) during the 2023–2024 academic year. The data covers courses such as Digital Signal Processing, Digital Image Processing, Circuits, Analog Electronics, Microcontroller Principles, and Interface Technology, and it integrates learning behavior records spanning multiple semesters. A partner university in Shandong Province uniformly collected, anonymized, and standardized all data; the sample is not limited to students from that institution. The data includes five primary categories: learner demographics, learning engagement, learning interactions, daily performance, and academic performance. These categories encompass 14 specific attributes, including gender, student ID, study duration, number of study sessions, number of assignments submitted, and video rumination ratio. See
Table 1 for specific details. This dataset originates from a partner university’s teaching platform and is classified as a non-public educational research dataset. Throughout the research process, strict adherence to privacy protection guidelines was maintained, and sensitive information, such as student ID numbers and gender, was anonymized and encrypted. Therefore, the raw data will not be made publicly available at this time.
Among them, the basic information includes two attributes, gender and student ID, which are characteristic categories that describe the basic information of learners. Learning engagement includes four attributes: learning duration, learning frequency, the number of submitted tasks, and video rumination ratio. It is a characteristic category that measures learners’ level of engagement in online learning. Learning interaction includes three attributes, likes, comments, and replies to comments, which are characteristic categories for measuring online learners’ online learning interaction. Daily performance includes four attributes, audio score, chapter test score, academic score, and homework score, which are characteristic categories for evaluating learners’ daily learning situation. The overall grade is the final grade, which is weighted by the four grades of daily performance.
This paper performs preprocessing on the raw data, including cleaning, transformation, and reduction. The main goal is to address issues such as outliers and missing values, making the data more suitable for modeling. This improves the model’s accuracy and reliability. Given the potential presence of noise and redundant information in the raw features during data processing, this paper employs data cleaning and standardization to mitigate the impact of outliers on the model. From the perspective of high-dimensional data modeling, future research could incorporate deep feature learning methods, such as Stacked Autoencoder (SAE), to perform nonlinear dimensionality reduction and representation optimization on learner features. This would further enhance the model’s adaptability to complex, high-dimensional data. After the online learning data is preprocessed, a standardized dataset is obtained for subsequent modeling and prediction. Additionally, to protect learners’ privacy, the data was anonymized. The study strictly adheres to the principle of data minimization by retaining only the features necessary for the learning situation prediction task. Future work could further integrate federated learning with homomorphic encryption mechanisms to enable cross-platform, collaborative data modeling while preventing the leakage of sensitive information. In this study, the learning situation status is the target variable for the classification task. Based on a comprehensive assessment of course grades and learning behaviors, learners are categorized into one of four groups: Excellent, Good, Average, and Passive. These labels are derived from grade ranges and reasonable adjustments using learning behavior features to ensure the validity and discriminative power of the labels.
4.2. Experimental Environment and Evaluation Indicators
The A-MTGCN model established in this paper is implemented based on the PyTorch deep learning framework. PyG is an algorithm library dedicated to processing graph-structured data, which can efficiently and flexibly implement various graph neural network models. The specific hardware and software information for the experiments in this section is shown in
Table 2.
In order to validate the performance of the proposed A-MTGCN model, it is analyzed in this paper in comparison with support vector machine (SVM), Random Forest (RF), MLP, GCN, Relational Graph Convolutional Network (RGCN), and MGAT models, covering two traditional machine learning algorithms and four deep learning algorithms. The parameters of all machine learning baseline models are optimized using cross-validation, while parameter settings of the deep learning models are kept consistent with the A-MTGCN model. There are several adjustable parameters in the A-MTGCN algorithm, among which the number of network layers is selected for tuning in the range of ; the random dropout ratio is set to 0.5, the similarity threshold for edge construction is selected in the range of ; the embedding dimensions are selected in the range of ; and the Adam optimizer is adopted in the experiments, with a learning rate set to 0.001, the -regularization factor set to 5 × 10−3, and the ReLU function used as the activation function.
Learner learning situation prediction is defined as a node classification problem in this paper, and the model performance is evaluated using four metrics: accuracy, precision, recall, and F1-score. Accuracy reflects the degree of agreement between the model’s predictions and the true labels and indicates the proportion of correctly classified samples among all samples, which is a measure of the overall performance of the model. Precision measures the proportion of samples predicted by the model as belonging to a particular category that actually belong to that category, reflecting the accuracy of the prediction. Reca, on the other hand, focuses on the model’s ability to identify all samples that actually belong to a certain category and indicates the proportion of correctly classified positive samples among all positive samples. F1-score is the harmonic mean of precision and recall, which combines the performance of both. By using these evaluation metrics in combination, the classification ability of the algorithm can be comprehensively measured from different dimensions, thus more accurately reflecting its overall performance in practical application scenarios.
4.3. Experiments and Analysis of Results
4.3.1. Comparative Experiments
To validate the performance of the A-MTGCN model, we conducted comparative experiments using the aforementioned in-house dataset, comparing it with baseline models, such as SVM, RF, MLP, GCN, GAT, R-GCN, Multiview, and MGAT. To ensure the scientific rigor and validity of the prediction results, as well as to minimize the impact of random errors on the experimental outcomes, this paper conducted an evaluation using 5-fold cross-validation, in addition to the original five replicates. The average of these results was taken as the final outcome. To verify the statistical significance of the performance improvement, we used a paired
t-test to analyze the experimental results. The results of the experiments are shown in
Table 3, where the optimal results are in bold.
As shown in
Table 3, the A-MTGCN model yielded the best results in all metrics, achieving 92.53%, 89.15%, 92.27%, and 87.83% across the four evaluation metrics, respectively. Compared to the second-best model (MGAT), the A-MTGCN model improved by 1.12%, 0.51%, 1.81%, and 0.41%, respectively. This indicates that the A-MTGCN model has a consistent advantage across multiple dimensions. To further validate the effectiveness of the proposed method, a paired Student’s
t-test was conducted based on the results obtained from the 5-fold cross-validation. The statistical analysis shows that A-MTGCN achieves highly significant improvements (
p < 0.01) over GAT in both accuracy and F1-score, with t-values of 4.86 and 5.42, respectively. In comparison with MGAT, A-MTGCN also demonstrates statistically significant improvements (
p < 0.05), with t-values of 3.12 for accuracy and 2.87 for F1-score. Similarly, when compared with the Multiview model, the proposed method achieves significant improvements, with t-values of 3.68 (accuracy) and 3.21 (F1-score), corresponding to
p-values below 0.05. These results confirm that the performance improvements of A-MTGCN are statistically significant and not due to random variations.
Overall, SVM and RF performed the worst. These two traditional machine learning methods rely solely on manual feature engineering, which makes it difficult to effectively capture the latent relationships among online learners. This limits their ability to represent high-dimensional data, thereby affecting classification performance. Among the remaining deep learning algorithms, MLP outperformed SVM and RF but still lagged behind GCN, RGCN, and MGAT. This is because MLP is a pointwise learning model that cannot leverage adjacency relationships among learners to propagate information. This limits the model’s ability to predict learner performance. GCN and GAT use a neighbor feature aggregation mechanism that enables node embeddings to incorporate information about surrounding learners. This results in significantly higher prediction accuracy than MLP. Additionally, RGCN and MGAT optimize the information propagation mechanism of GCNs further. The RGCN enhances the ability to learn heterogeneous relationships through relationship modeling. The MGAT adaptively adjusts the weights of different neighboring nodes via an attention mechanism to improve classification performance. Nevertheless, these methods primarily aggregate information based on a single or limited topological structure. The Multiview model outperforms the graph convolutional neural network (GCN) by integrating view information under multiple similarity metrics. However, it remains slightly inferior to the multi-graph attention transformer (MGAT), which incorporates an attention mechanism. These results suggest that fusing multi-view information positively impacts node classification tasks, though improvements to the fusion method are still needed. By contrast, A-MTGCN produced the best results across all evaluation metrics. This suggests that integrating multi-topological modeling and attention mechanisms effectively combines structural information from various similarity metrics. This significantly enhances the model’s expressive power.
4.3.2. Parameter Sensitivity Experiment
This paper investigates the sensitivity of the A-MTGCN model to its parameters by conducting a systematic series of experiments and analyses. This study focuses on three key parameters: the number of graph convolutional layers, the node embedding dimension, and the similarity threshold in the edge construction process. These parameters are examined for their impact on model performance. Pillai et al.’s research [
22] indicates that hyperparameter optimization is a key method for mitigating overfitting in deep learning models and enhancing their generalization ability. This research also provides the core theoretical basis for the parameter sensitivity analysis in this paper.
- (1)
Number of layers of the graph convolutional network
In the A-MTGCN model, the number of graph convolution layers directly affects the model’s ability to extract information about the graph structure, and an increase in the number of graph convolution layers implies that the order of the information about the neighboring nodes that can be aggregated also increased. The experimental results when the number of GCN layers is set from 1 to 4 are shown in
Figure 9. From the results, it can be seen that the number of GCN layers has a significant effect on the model performance. When the number of layers is 2, the model performs best, indicating that at this time, the model is able to achieve a better balance between the depth of information aggregation and the ability of feature differentiation, which is able to effectively capture the higher-order relationships in the graph structure, but also avoids the loss of information caused by over-smoothing of features, and at this time, it is able to capture the information of the graph structure better and achieve the optimal classification effect. However, when the number of layers continues to increase, the performance of the model decreases significantly in various indicators, which is mainly due to the phenomenon of over-smoothing caused by network deepening after stacking too many convolutional layers, and its representation gradually tends to be similar, resulting in the weakening of the differences between the node features, which in turn weakens the classification ability of the model.
- (2)
Node Embedding Dimension
The embedding dimension of graph convolutional neural networks determines the complexity of the mapping relationships they can fit. Lower embedding dimensions may not fully express the feature information of nodes, resulting in limited model performance. However, higher embedding dimensions may introduce redundant information, leading to increased computational costs and overfitting of the model. This article experimented with four embedding dimensions, 32, 64, 128, and 256, and the results are shown in
Figure 10. It can be seen that embedding dimensions have a significant impact on model performance. As the dimension increases from 32 to 128, the performance of the model gradually improves. When the dimension is 128, the model reaches its optimal performance, indicating that higher dimensions can enhance feature expression ability and improve classification performance. However, as the dimensionality further increased to 256, various indicators showed a decline due to the introduction of redundant information in high dimensions, which increased the complexity of the model and led to overfitting.
- (3)
Similarity Threshold
In graph construction, the similarity threshold parameter determines the edge connections between nodes, and a reasonable threshold value is crucial for the connectivity of the graph and the final performance of the model. Too low a threshold may lead to too many meaningless edges, which increases the computational complexity; while too high a threshold may lead to too sparse graphs, which may weaken the information dissemination effect. In this paper, experiments are conducted for thresholds of 0.5, 0.6, 0.7, 0.8, and 0.9, and the results are shown in
Figure 11. It can be seen that the model performs best when the threshold is 0.6, which indicates that the graph structure constructed by the model can better retain the effective correlation information between nodes under this threshold setting. As the threshold increases, accuracy, precision, recall, and F1-score all show a decreasing trend, and the model is most ineffective when it is 0.9. This is because too high a threshold makes the graph structure constructed between learner nodes highly sparse and the graph structure becomes very simple, indicating that it is missing a large amount of node information and edge information. The missing edge information and the interaction between nodes limit the depth and breadth of information propagation of the graph convolutional neural network, resulting in the model being unable to learn effective features. In real-world large-scale application scenarios, to enhance the efficiency and flexibility of model deployment, meta-heuristic optimization methods (such as Moth-Flame Optimization and Walrus Optimization) are used to automatically optimize the similarity threshold parameter A and other key model parameters. This approach enables the rapid identification of globally optimal parameter combinations, ensuring the stability and reliability of the model’s predictive performance. Additionally, by optimizing parameter configurations, it indirectly reduces redundant computational overhead, significantly enhancing the model’s deployment efficiency, adaptability, and practical value in large-scale scenarios.
4.3.3. Model Complexity Experiment
This paper also evaluates the computational cost of the models, measuring the complexity of the models in terms of time consumption and memory usage.
Figure 12 gives the time and memory consumed for a single round of training on the online learning dataset for the A-MTGCN model in this paper, as well as for the MLP, GCN, RGCN, GAT, MGAT, and Multiview models.
Figure 12 shows that there are significant differences among the models in terms of training time and memory consumption. The MLP has the lowest values for both. This is because the MLP does not involve graph structures and performs only fully connected computations. This results in the shortest training time and lowest memory usage. In contrast, since the GCN requires the aggregation of adjacency information, its computational complexity increases, leading to longer training times and higher memory usage. Building on this foundation, GAT introduces a node-level attention mechanism that adaptively weights the importance of neighboring nodes. This endows the model with stronger expressive power. However, this introduces additional computational overhead, resulting in higher resource consumption than GCN. The Multiview model enhances representational capacity by integrating graph structural information from multiple perspectives, but this significantly increases computational complexity. Its time and memory consumption are notably higher than those of single-topology models. This indicates that, while multi-view modeling can improve performance, it also imposes an additional resource burden. MGAT increases computational overhead while improving the model’s expressive power because it requires dynamic weighting of the importance of a learner’s neighboring nodes. RGCN performed the worst because it adds different relationship types to a graph neural network. It learns independent weights for each relationship’s neighboring nodes and propagates information accordingly. In online learning datasets, interactions between learners may involve multiple relationship types. This requires RGCN to model multiple weight matrices, thereby increasing computational and memory requirements. The A-MTGCN model, which is presented in this paper, fuses cross-topological information through multi-topological structures and attention mechanisms. Although this approach requires more computing power than the simple graph structure of GCN, it uses significantly less time and memory than MGAT and RGCN. Furthermore, it outperforms multi-view models, such as Multiview, achieving optimal classification performance while maintaining reasonable memory consumption and training time, thus striking a balance between computational efficiency and performance.
In summary, the A-MTGCN method proposed in this paper has obvious advantages over MGAT and RGCN in terms of computational complexity, although it is slightly higher than traditional methods such as MLP and GCN. And, the time consumption of A-MTGCN is controlled within 1 s during the training round, and the memory occupation is lower than 1 GB, which makes it more practical.
4.3.4. Ablation Experiment
To analyze the effectiveness of each component module in the A-MTGCN model in depth, we used the control variable method and designed multiple model variants for comparative testing. We observed changes in model performance by gradually removing or replacing different components of the model to quantify the contribution of each module to predicting learner performance. Here, A-MTGCN denotes the original prediction model proposed in this paper. This model includes multi-topology construction, a cross-topology information attention fusion mechanism, and a data augmentation module. A-MTGCNEuc, A-MTGCNcos, and A-MTGCNMan are models that adopt a single topological structure based on Euclidean distance, cosine similarity, and Manhattan distance, respectively. These models retain the attention fusion and data augmentation modules. A-MTGCNEuc+cos, A-MTGCNcos+Man, and A-MTGCNMan+Euc are models that adopt a combination of two topological structures. These models retain the attention fusion mechanism and data augmentation modules. A-MTGCNno-Att is a model that removes the data augmentation component based on the attention mechanism (Att) during the learner graph construction phase. This model dynamically adjusts the cross-topology information fusion using equal weights. A-MTGCNno-fusion denotes a model that completely removes the cross-topology information fusion module.
As shown in
Table 4, the various constituent modules influence the model’s performance to different degrees. This suggests that each key module of the A-MTGCN model is essential for improving prediction accuracy and model robustness. Under a single-topology condition, the cosine similarity-based model achieved the highest accuracy of 88.69%, indicating that semantic similarity offers a distinct advantage in modeling learning states. Euclidean distance performed relatively stably. However, Manhattan distance exhibited slightly lower performance due to its sensitivity to sparse features. A single topology struggles to capture the complex relationships among learners and only captures information about local similarities. All combinations of two topologies outperform single-topology models. The combination of Euclidean distance and cosine similarity yields the best results with an accuracy rate of 90.57%, demonstrating the strong complementarity between the two. Multi-topology structures capture latent association information from three distinct perspectives: Euclidean distance, cosine similarity, and Manhattan distance. This enables more comprehensive modeling of multi-level relationships among learners, improving the model’s overall performance. Comparing A-MTGCN and A-MTGCN
no-Att reveals that replacing the attention mechanism with equal weights in cross-topology information fusion significantly decreases model performance across all four evaluation metrics. This indicates that the attention mechanism dynamically adjusts the importance of different topologies to achieve superior information fusion. Eliminating the cross-topology information fusion module degraded the model’s performance further, confirming the effectiveness of multi-perspective information fusion. The multi-topology design effectively integrates multidimensional behavioral characteristics, such as assignment completion, classroom participation, and mastery of key concepts, thereby enabling more accurate predictions of academic performance. Different topologies capture behavioral features that complement one another in predictive tasks. Cosine similarity excels at capturing semantic consistency in learning behavior, such as consistent assignment submission patterns. Euclidean distance reflects overall behavioral trends, such as a steady learning progress rate. Manhattan distance is sensitive to local variations, such as fluctuations in test scores. Together, these three metrics provide a comprehensive portrayal of the complex relationships among learners.
In summary, the A-MTGCN model effectively captures multi-level correlation information among learners through the synergistic interaction of its multi-topological structure, attention fusion mechanism, and data augmentation module. This enables high-precision score prediction while demonstrating high transparency and practical value in terms of interpretability.