A Multi-Layer Attention Knowledge Tracking Method with Self-Supervised Noise Tolerance
Abstract
1. Introduction
2. Related Work
3. MASKT Model
3.1. Problem Definition
3.2. Feature Embedding Screening
3.3. MASKT Component
3.4. Self-Supervised Sequence Tasks
3.5. Introducing of Dynamic Masking Mechanism
- Equivalent data augmentation: By randomly transforming the mask patterns during training, dynamic masking effectively expands the diversity of training samples, pushing data utilization closer to its theoretical upper bound.
- Feature learning smoothing: By forcing the model to decouple its dependence on fixed contextual information, dynamic masking promotes the formation of robust feature representations, significantly reduces the risk of overfitting, and lowers RMSE in noise testing scenarios.
3.6. Noise Sequence Restoration Task
4. Multi-Layer Cross-Attention Embedding Module
4.1. Node-Dimension Level Attention
4.2. Historical Answer Sequence Attention
5. Fusion Encoder
5.1. Loss Function
5.2. Relevance Index
6. Experiment
6.1. Experimental Setup
6.2. Dataset
6.3. Comparison Experiment
6.4. Performance Evaluation
6.5. Comparison to BART
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
MASKT | Multi-layer Attention Self Supervised Knowledge Tracking Method |
BKT | Bayesian knowledge tracking model |
SAKT | self-attentive model for knowledge tracing |
AKT | Attentive Knowledge Tracing. |
PCA | Principal Component Analysis. |
References
- Yudelson, M.V.; Koedinger, K.R.; Gordon, G.J. Individualized bayesian knowledge tracing models. In Proceedings of the Artificial Intelligence in Education: 16th International Conference, AIED 2013, Memphis, TN, USA, 9–13 July 2013; pp. 171–180. [Google Scholar]
- Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep knowledge tracing. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 505–513. [Google Scholar]
- Pandey, S.; Karypis, G. A self-attentive model for knowledge tracing. arXiv 2019, arXiv:1907.06837. [Google Scholar] [CrossRef]
- Ghosh, A.; Heffernan, N.; Lan, A.S. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, CA USA, 6–10 July 2020; pp. 2330–2339. [Google Scholar]
- Sun, J.; Zhou, J.; Liu, S.; He, F.; Tang, Y. Hierarchical Attention Network Based Interpretable Knowledge Tracing. J. Comput. Res. Dev. 2021, 58, 2630–2644. [Google Scholar] [CrossRef]
- Tian, Z.; Zheng, G.; Flanagan, B. BEKT: Deep knowledge tracing with bidirectional encoder representations from transformers. In Proceedings of the International Conference on Computers in Education, Virtual, 22–26 November 2021. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5753–5763. [Google Scholar]
- Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13063–13075. [Google Scholar]
- Mnih, A.; Hinton, G.E. A scalable hierarchical distributed language model. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–10 December 2008; pp. 1081–1088. [Google Scholar]
- Peters, M.E.; Neumann, M.; Zettlemoyer, L.; Yih, W.T. Dissecting contextual word embeddings: Architecture and representation. arXiv 2018, arXiv:1808.08949. [Google Scholar] [CrossRef]
- Sun, F.; Liu, J.; Wu, J.; Pei, C.; Lin, X.; Ou, W.; Jiang, P. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1441–1450. [Google Scholar]
- Lee, W.; Chun, J.; Lee, Y.; Park, K.; Park, S. Contrastive learning for knowledge tracing. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 2330–2338. [Google Scholar]
- Lin, S.; Tian, H. Short-term metro passenger flow prediction based on random forest and LSTM. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; Volume 1, pp. 2520–2526. [Google Scholar]
- Zhu, Y.; Duan, J.; Li, Y.; Wu, T. Image classification method of cashmere and wool based on the multi-feature selection and random forest method. Text. Res. J. 2022, 92, 1012–1025. [Google Scholar] [CrossRef]
- Liu, Z.; Liu, Q.; Chen, J.; Huang, S.; Gao, B.; Luo, W.; Weng, J. Enhancing deep knowledge tracing with auxiliary tasks. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 4178–4187. [Google Scholar]
- Shen, S.; Liu, Q.; Chen, E.; Huang, Z.; Huang, W.; Yin, Y.; Su, Y.; Wang, S. Learning process-consistent knowledge tracing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1452–1460. [Google Scholar]
- Yeung, C.K.; Yeung, D.Y. Addressing two problems in deep knowledge tracing via prediction-consistent regularization. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, London, UK, 26–28 June 2018; pp. 1–10. [Google Scholar]
- Stamper, J.; Pardos, Z.A. The 2010 KDD Cup Competition Dataset: Engaging the machine learning community in predictive learning analytics. J. Learn. Anal. 2016, 3, 312–316. [Google Scholar] [CrossRef]
- Feng, M.; Heffernan, N.; Koedinger, K. Addressing the assessment challenge with an online system that tutors as it assesses. User Model. User-Adapt. Interact. 2009, 19, 243–266. [Google Scholar] [CrossRef]
- Koedinger, K.R.; Baker, R.S.; Cunningham, K.; Skogsholm, A.; Leber, B.; Stamper, J. A data repository for the EDM community: The PSLC DataShop. Handb. Educ. Data Min. 2010, 43, 43–56. [Google Scholar]
- King, D.R. Production implementation of recurrent neural networks in adaptive instructional systems. In Proceedings of the Adaptive Instructional Systems: Second International Conference, AIS 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, 19–24 July 2020; pp. 350–361. [Google Scholar]
- Choi, Y.; Lee, Y.; Shin, D.; Cho, J.; Park, S.; Lee, S.; Baek, J.; Bae, C.; Kim, B.; Heo, J. Ednet: A large-scale hierarchical dataset in education. In Proceedings of the Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, 6–10 July 2020; pp. 69–73. [Google Scholar]
- Wang, T.; Ma, F.; Gao, J. Deep hierarchical knowledge tracing. In Proceedings of the 12th International Conference on Educational Data Mining, Montréal, QC, Canada, 2–5 July 2019. [Google Scholar]
Behavioral Characteristics | Meaning |
---|---|
order_id | Student ID |
problem_id | Problem ID |
original | Whether the student’s original answer process was marked as correct or incorrect by the teacher |
correct | Whether the student’s answer is correct (1 for correct, 0 for incorrect) |
attempt_count | Number of practice attempts recorded when the student answered the question |
ms_first_response | Time from the start time to the student’s first action (in milliseconds) |
tutor_mode | Tutor mode |
answer_type | Answer type |
sequence_id | Sequence ID |
student_class_id | Student class ID |
problem_set_type | Problem set type |
base_sequence_id | Base sequence ID |
skill_id | Skill ID involved in the question |
skill_name | Question name |
teacher_id | Teacher ID |
school_id | School ID |
hint_count | Number of hints provided during answering |
hint_total | Total number of attempts during answering |
overlap_time | Time to complete the question (in milliseconds) |
answer_id | Answer ID |
answer_text | Answer text |
Feature Name | Positive Correlation | Negative Correlation |
---|---|---|
original (binary problem) | 0.32 | 0.25 |
attempt_count (total number of attempts) | 0.21 | 0.29 |
ms_first_response (start time) | 0.13 | 0.13 |
hint_count (number of hints) | 0.13 | 0.18 |
correct (whether correct) | 0.09 | 0.11 |
answer_type (answer type) | 0.06 | 0.03 |
tutor_mode (tutor mode) | 0.04 | 0.04 |
Position (position) | 0.04 | 0.01 |
hint_total (total number of hints) | 0.01 | 0.05 |
overlap_time (completion time) | 0.01 | 0.09 |
Dataset | Students | Knowledge | Interface |
---|---|---|---|
Algebra05 [19] | 574 | 436 | 607,026 |
ASSIST2009 [20] | 4151 | 110 | 325,673 |
ASSIST2012 [21] | 27,485 | 265 | 53,065 |
ASSIST2015 [22] | 19,840 | 100 | 683,801 |
EdNet [23] | 784,309 | 13,169 | 131,317,236 |
Method | Algebra05 | ASSIST09 | ASSIST12 | ASSIST15 | EdNet | |||||
---|---|---|---|---|---|---|---|---|---|---|
Mean ± Std | Mean ± Std | Mean ± Std | Mean ± Std | Mean ± Std | ||||||
DKT | 7.38 0.0020 | 1.54 0.1986 | 1.23 0.2877 | 1.50 0.2079 | 6.18 0.0035 | |||||
DKT+ | 7.41 0.0017 | 2.17 0.0962 | 0.90 0.4203 | 0.84 0.4475 | 3.90 0.0176 | |||||
DKVMN | 5.23 0.0063 | 1.82 0.1446 | 1.47 0.2149 | 1.69 0.1666 | 6.65 0.0027 | |||||
SAKT | 5.06 0.0072 | 2.21 0.0918 | 1.16 0.3105 | 1.99 0.1176 | 5.49 0.0054 | |||||
SKVMN | 5.25 0.0063 | 2.90 0.0441 | 0.19 0.8591 | 0.72 0.5125 | 4.37 0.0120 | |||||
SAINT | 4.47 0.0111 | 1.59 0.1879 | 0.28 0.7951 | 0.22 0.8377 | 2.69 0.0545 | |||||
IRT | 8.91 0.0009 | 3.44 0.0260 | 3.58 0.0183 | 2.68 0.0551 | 5.06 0.0072 | |||||
AKT | 6.22 0.0034 | 1.48 0.2133 | 0.94 0.4015 | 0.74 0.4995 | 0.70 0.5230 | |||||
MSKT | 1.25 0.2796 | 3.05 0.0380 | 0.87 0.4352 | 1.84 0.1399 | 1.17 0.3071 | |||||
MASKT | 0.8103 ± 0.0098 | 0.7794 ± 0.0124 | 0.7620 ± 0.0168 | 0.7453 ± 0.0159 | 0.7714 ± 0.0152 |
Method | Algebra05 | ASSIST09 | ASSIST12 | ASSIST15 | EdNet | |||||
---|---|---|---|---|---|---|---|---|---|---|
Mean ± Std | Mean ± Std | Mean ± Std | Mean ± Std | Mean ± Std | ||||||
DKT | 2.51 0.0665 | 0.32 0.7660 | 2.41 0.0740 | 0.81 0.4640 | 1.83 0.1409 | |||||
DKT+ | 2.16 0.0976 | 0.02 0.9860 | 1.67 0.1702 | 0.47 0.6635 | 1.88 0.1339 | |||||
DKVMN | 1.67 0.1700 | 0.18 0.8674 | 1.51 0.2064 | 0.98 0.3840 | 0.94 0.3990 | |||||
SAKT | 1.65 0.1749 | 0.31 0.7711 | 1.33 0.2552 | 0.13 0.9040 | 0.4165 ± 0.0162 | −1.51 0.6370 | ||||
SKVMN | 1.70 0.1647 | 0.73 0.5053 | 1.64 0.1768 | 0.62 0.5680 | 0.48 0.6560 | |||||
SAINT | 1.47 0.2156 | 0.05 0.9658 | 0.41 0.7032 | 1.04 0.3570 | −1.24 0.8240 | |||||
IRT | 3.13 0.0354 | 2.64 0.0578 | 2.00 0.1161 | 2.05 0.1100 | 3.89 0.0178 | |||||
AKT | 1.78 0.1493 | 0.4324 ± 0.0149 | −1.17 0.8750 | 1.69 0.1673 | 0.27 0.8030 | −1.49 0.6490 | ||||
MSKT | 0.58 0.5934 | 0.32 0.7665 | 2.70 0.0540 | 0.75 0.4970 | 0.76 0.4880 | |||||
MASKT | 0.3865 ± 0.0083 | 0.3994 ± 0.0092 | 0.4185 ± 0.0113 |
Method | Algebra05 | ASSIST09 | ASSIST12 | ASSIST15 | EdNet |
---|---|---|---|---|---|
MASKT-F | 0.8047 | 0.7751 | 0.7341 | 0.7451 | 0.7654 |
MASKT-D | 0.7992 | 0.7796 | 0.7473 | 0.7337 | 0.7631 |
MASKT-R | 0.7949 | 0.7841 | 0.7482 | 0.7401 | 0.7573 |
MASKT | 0.8103 | 0.7944 | 0.7620 | 0.7553 | 0.7714 |
Method | Algebra05 | ASSIST09 | ASSIST12 | ASSIST15 | EdNet |
---|---|---|---|---|---|
MASKT-D | 0.7992 | 0.7796 | 0.7473 | 0.7337 | 0.7631 |
MD-Dynamic | 0.8095 | 0.7953 | 0.7579 | 0.7426 | 0.7694 |
MASKT | 0.8103 | 0.7944 | 0.7620 | 0.7553 | 0.7714 |
MASKT-DY | 0.8196 | 0.8109 | 0.7673 | 0.7518 | 0.7786 |
Method | Algebra05 | ASSIST09 | ASSIST12 | ASSIST15 | EdNet |
---|---|---|---|---|---|
BART | 0.8071 | 0.7959 | 0.7678 | 0.7486 | 0.7747 |
MASKT-MA | 0.8047 | 0.7944 | 0.7620 | 0.7518 | 0.7714 |
MASKT | 0.8071 | 0.7959 | 0.7678 | 0.7486 | 0.7747 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, H.; Liu, H.; Ge, Y.; Yu, Z. A Multi-Layer Attention Knowledge Tracking Method with Self-Supervised Noise Tolerance. Appl. Sci. 2025, 15, 8717. https://doi.org/10.3390/app15158717
Wang H, Liu H, Ge Y, Yu Z. A Multi-Layer Attention Knowledge Tracking Method with Self-Supervised Noise Tolerance. Applied Sciences. 2025; 15(15):8717. https://doi.org/10.3390/app15158717
Chicago/Turabian StyleWang, Haifeng, Hao Liu, Yanling Ge, and Zhihao Yu. 2025. "A Multi-Layer Attention Knowledge Tracking Method with Self-Supervised Noise Tolerance" Applied Sciences 15, no. 15: 8717. https://doi.org/10.3390/app15158717
APA StyleWang, H., Liu, H., Ge, Y., & Yu, Z. (2025). A Multi-Layer Attention Knowledge Tracking Method with Self-Supervised Noise Tolerance. Applied Sciences, 15(15), 8717. https://doi.org/10.3390/app15158717