Robust Deep Knowledge Tracing with Out-of-Distribution Detection
Abstract
1. Introduction
- A robust deep knowledge tracing method that combines energy-based OOD detection into a transformer-based DKT model. The energy-based OOD detection allows DKT to better preserve predictive accuracy while handling distributional shifts.
- A more effective loss that integrates DKT’s binary cross-entropy and contrastive InfoNCE losses. This implementation can enhance the representation learning of a student’s capability by mitigating the data issue.
- A comprehensive evaluation that shows superior OOD detection and predictive accuracy when compared with baseline models. This study provides a path to integrate OOD detection into DKT models.
2. Related Work
2.1. Deep Knowledge Tracing
2.2. Energy-Based OOD Detection: Background
2.3. Problem
3. The Proposed Method
3.1. Input Representation
3.2. Sequence Modeling with Transformer
3.3. Energy-Based OOD Detection
3.4. Contrastive Loss Function
3.5. The Proposed Algorithm
| Algorithm 1 EB-OOD DKT (Batch Training and Inference) |
|
4. Experimental Results
4.1. Datasets
4.2. Experimental Setup
4.3. Results
4.3.1. In-Distribution Evaluation
4.3.2. Out-of-Distribution Detection
4.4. Embedding Visualization
4.5. Calibration and Overall Performance
4.6. Ablation Study
4.7. Parameter Sensitivity Analysis
4.8. Energy Score Distribution Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| DKT | Deep Knowledge Tracing |
| OOD | Out-of-Distribution |
| AUC | Area Under the Curve |
| BCE | Binary Cross-Entropy |
| KT | Knowledge Tracing |
| SAKT | Self-Attentive Knowledge Tracing |
| DKVMN | Dynamic Key–Value Memory Network |
| BKT | Bayesian Knowledge Tracing |
| IRT | Item Response Theory |
| LSTM | Long Short-Term Memory |
| RNN | Recurrent Neural Network |
| EBM | Energy-Based Model |
| AUROC | Area Under the Receiver Operating Characteristic Curve |
| FPR | False Positive Rate |
| TPR | True Positive Rate |
| ECE | Expected Calibration Error |
| CRL | Contrastive Representation Learning |
| ID | In-Distribution |
| t-SNE | t-distributed Stochastic Neighbor Embedding |
| GPU | Graphics Processing Unit |
| VRAM | Video Random Access Memory |
| NLSE | Negative Log-Sum-Exponential |
| 1 | https://www.kaggle.com/datasets/gmhost/ednetkt1 (accessed on 15 November 2025). |
| 2 | https://sites.google.com/site/assistmentsdata/ (accessed on 1 November 2025). |
| 3 | https://pslcdatashop.web.cmu.edu/KDDCup/login (accessed on 1 November 2025). |
| 4 | https://www.kaggle.com/datasets/fernandosr85/khan-academy-exercises (accessed on 25 October 2025). |
References
- Abdelrahman, G., Wang, Q., & Nunes, B. (2023). Knowledge tracing: A survey. ACM Computing Surveys, 55(11), 1–37. [Google Scholar] [CrossRef]
- Baker, R. S., & Inventado, P. S. (2014). Educational data mining and learning analytics. In J. A. Larusson, & B. White (Eds.), Learning analytics: From research to practice (pp. 61–75). Springer. [Google Scholar] [CrossRef]
- Bengio, Y., Courville, A., & Vincent, P. (2009). Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1–127. [Google Scholar] [CrossRef]
- Choi, Y., Lee, Y., Shin, D., Cho, J., Park, S., Lee, S., Baek, J., Bae, C., Kim, B., & Heo, J. (2020, July 6–10). EdNet: A large-scale hierarchical dataset in education. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial Intelligence in Education: 21st International Conference, AIED 2020, Proceedings, Part II (Vol. 12164, pp. 69–73). Lecture Notes in Artificial Intelligence. Springer. [Google Scholar] [CrossRef]
- Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4, 253–278. [Google Scholar] [CrossRef]
- Dai, H., Yun, Y., Zhang, Y., Zhang, W., & Shang, X. (2022). Contrastive deep knowledge tracing. In International conference on artificial intelligence in education (pp. 289–292). Springer. [Google Scholar]
- De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3366–3385. [Google Scholar] [CrossRef] [PubMed]
- D’Mello, S. K., & Graesser, A. C. (2015). Feeling, thinking, and computing with affect-aware learning technologies. In The oxford handbook of affective computing (pp. 419–434). Oxford Academic. [Google Scholar]
- Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Psychology Press. [Google Scholar]
- Feng, M., Heffernan, N., & Koedinger, K. (2009). Addressing the assessment challenge with an online system that tutors as it assesses. User Modeling and User-Adapted Interaction, 19, 243–266. [Google Scholar] [CrossRef]
- Futami, F., & Fujisawa, M. (2024). Information-theoretic generalization analysis for expected calibration error. Advances in Neural Information Processing Systems, 37, 84246–84297. [Google Scholar]
- Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In ICML’16: Proceedings of the 33rd international conference on international conference on machine learning—Volume 48 (pp. 1050–1059). JMLR.org. [Google Scholar]
- Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 1–37. [Google Scholar] [CrossRef]
- Gervet, T., Koedinger, K., Schneider, J., & Mitchell, T. (2020). When is deep learning the best approach to knowledge tracing? Journal of Educational Data Mining, 12(3), 31–54. [Google Scholar]
- Ghosh, A., Heffernan, N., & Lan, A. S. (2020). Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2330–2339). Association for Computing Machinery. [Google Scholar]
- Hendrycks, D., & Gimpel, K. (2017, April 24–26). A baseline for detecting misclassified and out-of-distribution examples in neural networks. International Conference on Learning Representations (ICLR), Toulon, France. [Google Scholar]
- Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS’17: Proceedings of the 31st international conference on neural information processing systems (Vol. 30, pp. 6402–6413). Curran Associates Inc. [Google Scholar]
- Lee, K., Lee, K., Lee, H., & Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In NIPS’18: Proceedings of the 32nd international conference on neural information processing systems (Vol. 31, pp. 7167–7177). Curran Associates Inc. [Google Scholar]
- Lee, W., Chun, J., Lee, Y., Park, K., & Park, S. (2022). Contrastive learning for knowledge tracing. In Proceedings of the ACM web conference 2022 (pp. 2330–2338). Association for Computing Machinery. [Google Scholar]
- Liu, W., Wang, X., Owens, J. D., & Li, Y. (2020). Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems (NeurIPS), 33, 21464–21475. [Google Scholar]
- Liu, Y., Yang, Y., Chen, X., Shen, J., Zhang, H., & Yu, Y. (2020, July 7–15). Improving knowledge tracing via pre-training question embeddings. Twenty-Ninth International Joint Conference on Artificial Intelligence (pp. 1577–1583), Yokohama, Japan. [Google Scholar]
- Liu, Z., Guo, T., Liang, Q., Hou, M., Zhan, B., Tang, J., Luo, W., & Weng, J. (2025). Deep learning based knowledge tracing: A review, a tool and empirical studies. IEEE Transactions on Knowledge & Data Engineering, 37, 4512–4536. [Google Scholar]
- Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363. [Google Scholar] [CrossRef]
- Mei, Y., Wang, X., Sun, C., Zhang, D., & Wang, X. (2025). Multi-label out-of-distribution detection with spectral normalized joint energy. World Wide Web, 28(4), 40. [Google Scholar] [CrossRef]
- Nagatani, K., Zhang, Q., Sato, M., Chen, Y.-Y., Chen, F., & Ohkuma, T. (2019). Augmenting knowledge tracing by considering forgetting behavior. In WWW ’19: The world wide web conference (pp. 3101–3107). Association for Computing Machinery. [Google Scholar]
- Pandey, S. K., & Karypis, G. (2019, July 2–5). A Self-Attentive model for Knowledge Tracing. 12th International Conference on Educational Data Mining (EDM), Montreal, QC, Canada. [Google Scholar]
- Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]
- Pavlik, P. I., Cen, H., & Koedinger, K. R. (2009, July 6–10). Performance factors analysis—A new alternative to knowledge tracing. 14th International Conference on Artificial Intelligence in Education, Brighton, UK. [Google Scholar]
- Piech, C., Spencer, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L., & Sohl-Dickstein, J. (2015). Deep knowledge tracing. Advances in Neural Information Processing Systems (NeurIPS), 28, 1–12. [Google Scholar]
- Reckase, M. D. (2009). Multidimensional Item Response Theory. Springer. [Google Scholar]
- Scarlatos, A., Baker, R. S., & Lan, A. (2025). Exploring knowledge tracing in tutor-student dialogues using LLMs. In Proceedings of the 15th international learning analytics and knowledge conference (pp. 249–259). Association for Computing Machinery. [Google Scholar]
- Sehwag, V., Chiang, M., & Mittal, P. (2021, May 3–7). SSD: A unified framework for self-supervised outlier detection. International Conference on Learning Representations, Virtual. [Google Scholar]
- Shen, S., Liu, Q., Huang, Z., Zheng, Y., Yin, M., Wang, M., & Chen, E. (2024). A survey of knowledge tracing: Models, variants, and applications. IEEE Transactions on Learning Technologies, 17, 1858–1879. [Google Scholar] [CrossRef]
- Sun, X., Zhang, K., Liu, Q., Shen, S., Wang, F., Guo, Y., & Chen, E. (2025). DASKT: A dynamic affect simulation method for knowledge tracing. IEEE Transactions on Knowledge & Data Engineering, 37(4), 1714–1727. [Google Scholar]
- van den Oord, A., Li, Y., & Vinyals, O. (2019). Representation learning with contrastive predictive coding. arXiv, arXiv:1807.03748. [Google Scholar] [CrossRef]
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 5998–6008. [Google Scholar]
- Vie, J., & Kashima, H. (2019). Knowledge tracing machines: Factorization machines for knowledge tracing. In Proceedings of the 12th international conference on educational data mining (AAI2019). AAAI Press. [Google Scholar]
- Wang, Z., Xu, B., Yuan, Y., Shen, H., & Cheng, X. (2025). InfoNCE is a free lunch for semantically guided graph contrastive learning. In Proceedings of the 48th international ACM SIGIR conference on research and development in information retrieval (pp. 719–728). Association for Computing Machinery. [Google Scholar]
- Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., & Weinberger, K. Q. (2019). Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning (ICML) (pp. 6861–6871). PMLR. [Google Scholar]
- Yang, J., Zhou, K., Li, Y., & Liu, Z. (2024). Generalized out-of-distribution detection: A survey. arXiv, arXiv:2110.11334. [Google Scholar] [CrossRef]
- Yeung, C.-K., & Yeung, D.-Y. (2018). Addressing two problems in deep knowledge tracing via prediction-consistent regularization. In Proceedings of the fifth annual ACM conference on learning at scale (pp. 1–10). Association for Computing Machinery. [Google Scholar]
- Zhang, J., Shi, X., King, I., & Yeung, D.-Y. (2017). Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th international conference on world wide web (WWW) (pp. 765–774). International World Wide Web Conferences Steering Committee. [Google Scholar]
- Zhang, L., Xiong, X., Zhao, S., Botelho, A., & Heffernan, N. T. (2017). Incorporating rich features into deep knowledge tracing. In Proceedings of the fourth ACM conference on learning @ scale (pp. 169–172). Association for Computing Machinery. [Google Scholar]
- Zhang, W., Zhang, Y., Liu, S., & Shang, X. (2022). Online deep knowledge tracing. In 2022 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 292–297). IEEE. [Google Scholar] [CrossRef]
- Zhang, Y., An, R., Liu, S., Cui, J., & Shang, X. (2023a). Predicting and understanding student learning performance using multi-source sparse attention convolutional neural networks. IEEE Transactions on Big Data, 9(1), 118–132. [Google Scholar] [CrossRef]
- Zhang, Y., An, R., Zhang, W., Liu, S., & Shang, X. (2023b). Deep knowledge tracing with concept trees. In International conference on advanced data mining and applications (pp. 377–390). Springer. [Google Scholar]
- Zhang, Y., Dai, H., Yun, Y., Liu, S., Lan, A., & Shang, X. (2020). Meta-knowledge dictionary learning on 1-bit response data for student knowledge diagnosis. Knowledge-Based Systems, 205, 106290. [Google Scholar] [CrossRef]
- Zhang, Y., Qu, X., Liu, S., Pang, Y., & Shang, X. (2025). Multiscale weisfeiler-leman directed graph neural networks for prerequisite-link prediction. IEEE Transactions on Knowledge and Data Engineering, 37(6), 3556–3569. [Google Scholar] [CrossRef]
- Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., & Sun, M. (2021). Graph neural networks: A review of methods and applications. arXiv, arXiv:1812.08434. [Google Scholar] [CrossRef]









| Symbol | Definition |
|---|---|
| Input at time step t representing the student interaction tuple | |
| Skill identifier corresponding to the learning concept attempted at step t | |
| Binary correctness indicator (1 = correct, 0 = incorrect) | |
| Normalized time interval between successive interactions | |
| Cumulative number of attempts made by the student on skill | |
| Embedded vector combining cognitive (skill, response) and behavioral (time, attempt) features | |
| Transformer encoder hidden state representing contextualized knowledge at time t | |
| Learnable weights and bias for the prediction layer (used in ) | |
| Predicted probability of correctness for the next interaction | |
| Binary Cross-Entropy loss for correctness prediction | |
| InfoNCE-based contrastive loss for latent representation separation | |
| Joint loss combining and with weight | |
| Energy score computed from model logits for OOD detection (negative log-sum-exp) | |
| T | Temperature parameter controlling logit scaling and smoothness |
| Mean and standard deviation of energy values on training data for normalization | |
| Energy threshold for OOD classification decision | |
| Hyperparameter balancing contrastive and prediction losses | |
| Model parameters optimized via Adam optimizer | |
| Out-of-Distribution indicator (1 if OOD, 0 otherwise) for input |
| Dataset | Role | Students | Interactions | Primary Domain |
|---|---|---|---|---|
| EdNet-KT1 | ID | 5000 | >1,000,000 | Multi-Subject (K–12) |
| ASSIST2009 | OOD | 4151 | 25,637 | Algebra |
| ASSIST2015 | OOD | 3800 | 94,675 | Algebra (Word, Multi-step) |
| Algebra 2005–2006 | OOD | 3000 | 4000 | Cognitive Tutor |
| Khan Academy | OOD | 2800 | 24,000 | Multi-Skill Sequence |
| Model | Accuracy | AUC |
|---|---|---|
| DKT | 0.734 | 0.812 |
| SAKT | 0.741 | 0.824 |
| DKVMN | 0.728 | 0.805 |
| EB-OOD DKT | 0.762 | 0.847 |
| Dataset | AUROC↑ | FPR@95↓ |
|---|---|---|
| ASSIST2009 | 0.812 | 0.28 |
| ASSIST2015 | 0.826 | 0.26 |
| Algebra 2005–2006 | 0.803 | 0.29 |
| Khan Academy | 0.815 | 0.27 |
| ↑ higher is better; ↓ lower is better. | ||
| Metric/Dataset | Best Baseline | EB-OOD DKT (Ours) |
|---|---|---|
| Accuracy (EdNet-KT1) | 0.741 (SAKT) | 0.762 |
| AUC (EdNet-KT1) | 0.824 (SAKT) | 0.847 |
| AUROC (ASSIST2015) | 0.743 (SAKT) | 0.826 |
| AUROC (Khan Academy) | 0.718 (SAKT) | 0.815 |
| FPR@95 (Algebra 2005–2006) | 0.42 (SAKT) | 0.29 |
| ECE (Calibration) | 0.088 (SAKT) | 0.064 |
| Configuration | AUROC |
|---|---|
| Full Model (EB-OOD DKT) | 0.94 |
| w/o Contrastive Loss | 0.88 |
| w/o Temporal and Behavioral Features | 0.85 |
| w/o Energy Normalization | 0.89 |
| Low Temperature () | 0.86 |
| High Temperature () | 0.87 |
| Dataset | Mean Energy (ID) | Mean Energy (OOD) | Difference |
|---|---|---|---|
| ASSIST2009 | 1.52 | 2.08 | 0.56 |
| ASSIST2015 | 1.48 | 1.91 | 0.43 |
| Algebra 2005–2006 | 1.45 | 2.12 | 0.67 |
| Khan Academy | 1.50 | 2.05 | 0.55 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hasan, R.; Zhang, Y. Robust Deep Knowledge Tracing with Out-of-Distribution Detection. AI Educ. 2026, 2, 6. https://doi.org/10.3390/aieduc2010006
Hasan R, Zhang Y. Robust Deep Knowledge Tracing with Out-of-Distribution Detection. AI in Education. 2026; 2(1):6. https://doi.org/10.3390/aieduc2010006
Chicago/Turabian StyleHasan, Riyan, and Yupei Zhang. 2026. "Robust Deep Knowledge Tracing with Out-of-Distribution Detection" AI in Education 2, no. 1: 6. https://doi.org/10.3390/aieduc2010006
APA StyleHasan, R., & Zhang, Y. (2026). Robust Deep Knowledge Tracing with Out-of-Distribution Detection. AI in Education, 2(1), 6. https://doi.org/10.3390/aieduc2010006

