Robust Deep Knowledge Tracing with Out-of-Distribution Detection

Hasan, Riyan; Zhang, Yupei

doi:10.3390/aieduc2010006

Open AccessArticle

Robust Deep Knowledge Tracing with Out-of-Distribution Detection

by

Riyan Hasan

^1,2 and

Yupei Zhang

^1,2,*

¹

School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China

²

Big Data Storage and Management MIIT Lab, Xi’an 710129, China

^*

Author to whom correspondence should be addressed.

AI Educ. 2026, 2(1), 6; https://doi.org/10.3390/aieduc2010006

Submission received: 12 January 2026 / Revised: 11 February 2026 / Accepted: 20 February 2026 / Published: 9 March 2026

Download

Browse Figures

Versions Notes

Abstract

Modeling the temporal dynamics of student learning is a central goal in educational data mining. Deep Knowledge Tracing (DKT) has emerged as a key approach, yet existing models are highly sensitive to out-of-distribution (OOD) inputs, such as those arising from curriculum changes, new assessment formats, or behavioral noise, which severely degrade predictive reliability. To address this challenge, we propose Energy-Based Out-of-Distribution Deep Knowledge Tracing (EB-OOD DKT), a unified framework that integrates energy-based uncertainty estimation and contrastive representation learning within a transformer-based DKT architecture. The model computes energy scores via the negative log-sum-exponential of prediction logits, serving as confidence indicators for detecting OOD inputs during inference. Additionally, an InfoNCE-based contrastive loss enhances representation robustness by aligning in-distribution samples and separating OOD cases in latent space. Temporal and behavioral context features, such as normalized response intervals and cumulative attempt counts, are incorporated to enrich cognitive-behavioral modeling. Experiments on four public educational datasets demonstrate consistent improvements in prediction accuracy and OOD detection. EB-OOD DKT provides a promising approach for more reliable student modeling across educational platforms with different content distributions.

Keywords:

deep knowledge tracing; out-of-distribution detection; contrastive loss; reliable model

1. Introduction

DKT aims to track a learner’s evolving mastery of concepts by modeling sequential responses by leveraging recurrent neural networks, predicting future performance (Dai et al., 2022; Z. Liu et al., 2025; Pandey & Karypis, 2019). It enables precise predictions of future performance and thus supports adaptive learning systems that deliver targeted interventions to enhance retention and prevent forgetting. Recent studies have advanced DKT via attention mechanisms, graph embeddings, and multimodal integration, improving generalization and interpretability across educational platforms Abdelrahman et al. (2023); Baker and Inventado (2014); D’Mello and Graesser (2015); Gervet et al. (2020); Shen et al. (2024); Sun et al. (2025); Wu et al. (2019); Y. Zhang et al. (2020); Zhou et al. (2021).

However, DKT models often ignore the out-of-distribution data. The representative model, Self-Attentive Knowledge Tracing (SAKT), was introduced by utilizing the power of self-attention mechanisms that are capable of recognizing long-range dependence in student responses (Corbett & Anderson, 1994). Despite such achievements, a basic drawback of current DKT approaches is that they base a student’s activities on their conformity to learning data in a model, making them highly vulnerable to OOD inputs. This is because such inputs include a range of issues such as curriculum drifts, new types of questions, changes in platforms, or irregular student actions (Pavlik et al., 2009) that represent OOD sequences. Also, basic approaches to baselines in DKT frequently make predictions with a high level of certainty when faced with inputs involving abnormal sequences, thus leading to a loss of reliability within an adaptive learning setting.

In order to solve this OOD question, a framework based on energy is proposed in this study to extend DKT. In this model, an energy-based score W. Liu et al. (2020) is developed to represent a measure of the confidence of predictions. A high value of the energy score indicates high uncertainty, which is a sign of inputs that are out of distribution. By employing an OOD detection strategy based on energy, it is possible to immediately spot OOD sequences and make early changes in the learning system to maintain integrity in student models (Pandey & Karypis, 2019).

As illustrated in Figure 1, student interactions are encoded into dense representations and processed by a Transformer encoder, which outputs a prediction head for next-step correctness and an energy head for OOD detection. Model training uses binary cross-entropy and contrastive InfoNCE losses, where solid arrows denote inference and dashed arrows indicate training-only signals. The contributions of this work can be summarized as follows:

A robust deep knowledge tracing method that combines energy-based OOD detection into a transformer-based DKT model. The energy-based OOD detection allows DKT to better preserve predictive accuracy while handling distributional shifts.
A more effective loss that integrates DKT’s binary cross-entropy and contrastive InfoNCE losses. This implementation can enhance the representation learning of a student’s capability by mitigating the data issue.
A comprehensive evaluation that shows superior OOD detection and predictive accuracy when compared with baseline models. This study provides a path to integrate OOD detection into DKT models.

2. Related Work

Knowledge Tracing (KT) has changed over time from interpretable probabilistic models to powerful but hard-to-understand deep learning architectures (Shen et al., 2024; Y. Zhang et al., 2023b). To smooth our introduction, this section examines current research in two principal areas: Knowledge Tracing and its extensions, and OOD detection. All used symbols are listed in Table 1.

2.1. Deep Knowledge Tracing

The development of DKT (Piech et al., 2015; Scarlatos et al., 2025) marked an important milestone in KT. DKT used recurrent neural networks (RNNs) with long short-term memory (LSTM) units to learn patterns in student interactions over time. Specifically, the hidden state updates at each time step t as follows:

h_{t} = tanh (W_{h x} x_{t} + W_{h h} h_{t - 1} + b_{h})

(1)

where

h_{t}

indicates the hidden state at the time-step t. The probability of answering the next question correctly is computed as:

p_{t} = σ (W_{h y} h_{t} + b_{y})

(2)

where

σ

is the sigmoid function. This allowed DKT to model long-range temporal dependencies, yielding major improvements over BKT and IRT (Embretson & Reise, 2013; Reckase, 2009). However, DKT also introduced several issues, such as overfitting and poor interpretability Nagatani et al. (2019); Yeung and Yeung (2018). Subsequent work addressed these limitations through architectural modifications L. Zhang et al. (2017); W. Zhang et al. (2022), regularization techniques Y. Liu et al. (2020); Vie and Kashima (2019), and interpretable attention mechanisms Ghosh et al. (2020).

2.2. Energy-Based OOD Detection: Background

Energy-Based Models (EBMs) are a powerful tool for OOD detection Mei et al. (2025). In EBMs, lower energy scores correspond to in-distribution data, while higher energy scores indicate OOD data. The energy function W. Liu et al. (2020) is defined as:

E (x) = - log (\sum_{i} e^{f_{i} (x) / T})

(3)

where

f_{i} (x)

represents the logits of the model, and T is the temperature parameter. Lower energy values indicate high model confidence that input x belongs to the learned distribution, while higher values signal potential out-of-distribution inputs. Alternative OOD detection approaches include maximum softmax probability Hendrycks and Gimpel (2017), Mahalanobis distance-based methods K. Lee et al. (2018), ensemble-based uncertainty estimation Lakshminarayanan et al. (2017), and Bayesian neural networks Gal and Ghahramani (2016). However, energy-based scoring offers computational efficiency and parameter-free deployment advantages compared with these alternatives W. Liu et al. (2020).

2.3. Problem

While prior studies achieved significant progress in sequential student modeling and OOD detection separately, no current framework effectively integrates the two for educational settings. Traditional KT methods often meet the out-of-distribution data, while no OOD techniques are integrated into DKT. This gap motivates the creation of the EB-OOD DKT framework, which combines energy-based uncertainty estimation with transformer-based KT for a reliable DKT model.

3. The Proposed Method

The EB-OOD DKT framework combines energy-based OOD detection Mei et al. (2025) with contrastive representation learning to improve the resilience of DKT models. This approach addresses the issue posed by OOD inputs interaction patterns that differ from those encountered during the training phase. These shifts in distribution may arise due to updates in the curriculum, variations in question formats, or atypical student behaviors, which can significantly diminish model accuracy and dependability if not properly managed (Hendrycks & Gimpel, 2017).

As shown in Figure 2, sequences of student interactions are embedded and projected before temporal encoding with a self-attention (Transformer) encoder. The prediction head outputs next-step correctness

{\hat{y}}_{t + 1}

, while a parallel energy head computes an OOD confidence score

E (x_{t}) = - log \sum exp (z_{t})

from the logits

z_{t}

. A threshold

τ

on

E (x_{t})

distinguishes in-distribution from out-of-distribution sequences, providing real-time reliability assessment alongside performance prediction. Training is multi-objective: binary cross-entropy optimizes

{\hat{y}}_{t + 1}

, and a contrastive InfoNCE loss shapes the representation such that ID samples yield lower energy and OOD samples higher energy. This joint-head design produces calibrated predictions with explicit uncertainty signals, supporting robust knowledge tracing under domain shift.

3.1. Input Representation

In the context of KT, student interaction is modeled using a set of key features that capture the student’s knowledge state. These key features can include Skill ID, Response Correctness, Time Interval, and Cumulative Attempts, enabling a comprehensive student knowledge state to be generated (Piech et al., 2015).

Skill ID

s_{t}

represents the concept of learning being evaluated at time t. Being a categorical variable, it requires an embedding transform to enable the creation of a continuous vector space. The embedding formula is given by:

e_{s_{t}} = Embedding (s_{t})

where

e_{s_{t}} \in R^{d}

represents the embedding vector for Skill ID

s_{t}

, and d represents the embedding dimension. This embedding layer is learned during training using backpropagation, allowing the model to capture relationships between different skills by adjusting the embeddings based on the error in its predictions (van den Oord et al., 2019).

Response Correctness

r_{t}

indicates whether the student’s response at time t was correct (1) or incorrect (0). It is incorporated in this existing model in its raw format in order to give immediate feedback on students’ performance (Piech et al., 2015).

Time Interval (

Δ t_{t}

) is a continuous feature, which calculates the time difference between the current event and its immediately preceding event. The time interval is normalized using Min-Max scaling to ensure equal units for every time interval. This normalization allows the model to adequately interpret learning phenomena (van den Oord et al., 2019).

Cumulative Attempts (

a_{t}

) is a continuous variable and is calculated based on the total number of attempts that a student has made regarding that skill till time t. The feature is log-transformed and then normalized to address skewness in the distribution of attempts, ensuring that the model is not biased by a small number of skills that may have unusually high attempt counts (Piech et al., 2015).

Then, the complete input feature vector

e_{t}

at time t is composed of the sum of individual features, which encompasses the embedding feature identified by Skill ID (

s_{t}

) and Response Correctness (

r_{t}

) feature, as well as the projection mappings of the continuous features, namely Time Interval (

Δ t_{t}

) and Cumulative Attempts (

a_{t}

):

e_{t} = Embed (s_{t}) + Embed (r_{t}) + Proj (Δ t_{t}) + Proj (a_{t})

(4)

which is used as input into our model, which exploits this sequence of input features to predict student performance on future tasks as well as identify OOD examples (Piech et al., 2015).

3.2. Sequence Modeling with Transformer

A Transformer encoder (Vaswani et al., 2017) models the sequence of embeddings

e_{1}

,

e_{2}

, …,

e_{t}

leveraging self-attention to capture both local and long-range dependencies in student learning trajectories. Passing the sequence through multiple attention layers yields hidden states

h_{1}

,

h_{2}

, …,

h_{T}

where each

h_{t} \in R^{d}

represents the contextualized knowledge state at time t. The probability of correctness is next predicted by

{\hat{r}}_{t + 1} : = σ (W_{h} \cdot h_{t} + b) .

(5)

For model training, the binary cross-entropy (BCE) is employed, which is:

L_{B C E} = - \sum_{t = 1}^{N} [r_{t + 1} log ({\hat{r}}_{t + 1}) + (1 - r_{t + 1}) log (1 - {\hat{r}}_{t + 1})]

(6)

where

r_{t + 1}

is the true outcome is and N indicates the total number of time-steps.

3.3. Energy-Based OOD Detection

During training, energy scores guide threshold calibration and provide implicit supervision via contrastive loss (Section 3.4). At inference, only the energy score E(x) and threshold

τ

are computed; no contrastive pairs are needed. We integrate energy-based OOD detection into DKT using model logits, which are the raw unnormalized scores from the final layer before activation. For binary classification (in-distribution vs. out-of-distribution), the model outputs two logits:

z_{1}

and

z_{2}

.

To compute the energy score

E (x)

, which quantifies prediction confidence, we apply the negative log-sum-exponential (NLSE) to these logits:

E (x) = - T \cdot log \sum_{i = 1}^{c} exp (\frac{z_{i}}{T})

(7)

where T is the temperature parameter controlling energy landscape smoothness. Lower T sharpens ID/OOD separation but increases false positives; higher T provides conservative estimates but may miss subtle shifts. We set T via grid search on validation data Section 4.6.

\hat{E} (x) = \frac{E (x) - μ_{E}}{σ_{E}}

(8)

where E and

μ_{E}

are the mean and standard deviation of the energy values computed over the training set. An input sequence is classified as OOD if its normalized energy exceeds a threshold

τ

(W. Liu et al., 2020).

3.4. Contrastive Loss Function

With the aim of enhancing the model’s effectiveness regarding OOD detection and Knowledge Tracing in the EB-OOD DKT framework, contrastive representation learning (CRL) was incorporated in our method. CRL benefits the model in being able to distinguish between in-distribution and OOD sequences more effectively in the learned latent space of student interactions (Bengio et al., 2013; van den Oord et al., 2019).

The InfoNCE loss function Wang et al. (2025) is used to optimize the vector representations produced by the Transformer encoder (van den Oord et al., 2019), aiming to bring in-distribution (ID) interactions closer together in the embedding space while pushing OOD interactions apart. The function is expressed as:

L_{contrastive} = - log (\frac{exp (sim (q, k^{+}) / τ)}{exp (sim (q, k^{+}) / τ) + \sum_{j} exp (sim (q, k_{j}^{-}) / τ)})

(9)

where q is the query (anchor interaction),

k^{+}

is the positive key (similar interaction), and

k_{j}^{-}

are the negative keys (dissimilar interactions). The similarity is typically computed using cosine similarity, and

τ

is a temperature parameter that adjusts the sharpness of the distribution.

In KT, contrastive learning adjusts student interaction embeddings to maximize similarities between positive pairs and to minimize similarities between negative pairs W. Lee et al. (2022). A positive pair is defined as two successive student interactions on the same skill. This is because it represents closely related knowledge states. Negative pairs are defined as random selections of student interaction pairs belonging to different skills. These pairs are most likely to be less similar or dissimilar. This is because InfoNCE loss is used to optimize positive and negative pairs of student interaction embeddings to move those of dissimilar student interaction pairs (belonging to different skills) away.

This two-fold methodology thus maximizes both the classification (student response prediction) and OOD identification tasks, making the model more robust. The model is also able to better distinguish between similar and dissimilar behaviors among students due to the increased accuracy of student interaction embeddings that improve its generalization ability among students and different interaction behaviors.

3.5. The Proposed Algorithm

In summary, as shown in Figure 2, the proposed algorithm process is as follows: Each input

x_{t}

is encoded into embedding

e_{t}

, passed through the Transformer encoder to obtain

z_{i}

, and the energy score

E (x_{t})

determines the OOD decision. Algorithm 1 shows the whole process of training and inference for the proposed EB-OOD DKT framework. The model encodes how students interact with each other into embeddings (line 5) and then uses a Transformer encoder to find temporal dependencies (line 6). We make predictions for the next step’s correctness (line 7), and we use both binary cross-entropy loss and contrastive loss to make sure that the performance prediction is accurate and the latent representations are strong (lines 8–10). Adam (line 11) is used to optimize the joint loss, and energy scores (line 13) are used to help choose the OOD threshold

τ

and early stopping (lines 14–15). A single forward pass (lines 18–23) gives both the predicted probability of student correctness

r_{t + 1}

and an OOD detection label. This dual output lets the framework make accurate predictions while also flagging interactions that might be out of distribution, making it strong in across educational settings.

Algorithm 1 EB-OOD DKT (Batch Training and Inference)

1:: // ===== TRAINING PHASE =====
2:: Initialization: parameters $θ$ , learning rate $η$ , temperature T, contrastive weight $λ$
3:: Data: training set Q, validation set V
4:: for each epoch $= 1 . . E$ do
5:: for each minibatch $B \subset Q$ do
6:: Encode each interaction $x_{t} \in Q$ to embedding $e_{t}$ (4)
7:: Obtain hidden states $h_{t} = Transformer (e_{t})$
8:: Predict next response: $r_{t + 1} = σ (W h_{t} + b)$
9:: Compute binary cross-entropy loss with true labels $L_{B C E}$
10:: Form contrastive pairs within the batch
11:: Form contrastive pairs within the batch and compute $L_{contrastive}$ using Equation (9)
12:: Compute joint loss: $L_{t o t a l} = L_{B C E} + λ L_{c o n t r a s t i v e}$
13:: Update parameters: $θ : = θ - η \cdot \nabla_{θ} L_{t o t a l}$ (via Adam optimizer)
14:: end for
15:: Compute energy scores on V: $E (h_{t}) = - T log \sum_{y \in {0, 1}} exp (f_{θ} (h_{t}, y) / T)$
16:: Validate performance (e.g., AUROC) and select threshold $τ$
17:: Apply early stopping based on validation metric
18:: end for
19:: Return: trained model and OOD threshold $τ$

20:: // ===== INFERENCE PHASE =====
21:: Inference (Single Forward Pass)
22:: Input: interaction $x_{t}$ , trained model $θ$ , temperature T, threshold $τ$
23:: Encode $x_{t}$ to $e_{t}$ , then to $h_{t}$ via Transformer encoder
24:: Predict response: $r_{t + 1} = σ (W h_{t} + b)$
25:: Compute energy score: $E (h_{t}) = - T log \sum_{y} exp (f_{Θ} (h_{t}, y) / T)$
26:: Assign OOD label: $O O D (x_{t}) = 1 {E (h_{t}) > τ}$
27:: Output: $r_{t + 1}, O O D (x_{t})$

4. Experimental Results

4.1. Datasets

For evaluations, we used one in-distribution (ID) dataset and four OOD datasets to investigate the model’s performance in OOD settings, introduced as follows.

EdNet-KT1 (ID)1: The KT1 subset of EdNet Choi et al. (2020) serves as the in-distribution baseline. It includes over 1 million interactions across 5000+ students spanning multiple K-12 subjects. The dataset includes 1000+ unique skill IDs, binary response correctness (1 for correct, 0 for incorrect), continuous and normalized time intervals, and log-transformed, normalized cumulative attempts. EdNet-KT1 originates from an AI-powered tutoring platform serving South Korean students (grades 7-12) in mathematics and English. A skill represents a fine-grained learning objective (e.g., solving linear equations with variables on both sides). We selected EdNet-KT1 as the in-distribution baseline due to its scale 1M+ interactions for robust training, structured curriculum, standardized learning patterns, and data quality (automated logging with complete timestamps and annotations).

ASSIST2009 (OOD)2: The ASSIST2009 dataset Feng et al. (2009) contains 4151 students and 25,637 interactions, focusing on algebra. It includes 2000 unique Skill IDs, binary response correctness (1 for correct, 0 for incorrect), continuous and normalized time intervals, and log-transformed, normalized cumulative attempts.

ASSIST2015 (OOD): The ASSIST2015 dataset has 3800 students and 94,675 interactions, focusing on multi-step problem-solving and word problems. It includes 3500 unique Skill IDs, binary response correctness, continuous and normalized time intervals, and log-transformed, normalized cumulative attempts.

Algebra 2005–2006 (OOD)3: Released as part of KDD Cup 2006, this dataset contains around 3000 students and 4000 interactions in the Cognitive Tutor environment. It includes approximately 500 unique Skill IDs, binary response correctness, continuous and normalized time intervals, and log-transformed, normalized cumulative attempts.

Khan Academy (OOD)4: Unlike KT1’s single-skill focus, Khan Academy spans a wide range of multi-skill sequences across subjects such as mathematics, science, and humanities. Its non-linear structure leads to varied learning trajectories as students engage with multiple skills simultaneously. While precise counts of Skill IDs and interactions are unavailable, the dataset includes hundreds of unique Skill IDs. Response correctness is binary, time intervals are normalized, and cumulative attempts are log-transformed and normalized.

The key statistics of all datasets are summarized in Table 2.

4.2. Experimental Setup

The Experimental Setup: Each dataset was standardized to a common format of (user_id, skill_id, correctness, timestamp, prior_attempts) following standard KT protocols (Choi et al., 2020; Pandey & Karypis, 2019). Missing values and prior attempts were omitted, timestamps were z-normalized, and prior attempts were log-transformed. In EdNet-KT1, 70% of the data were used for training, 10% for validation, and 20% for testing, ensuring no overlap between and within splits. OOD datasets were used solely for testing to evaluate generalization capability.

Running Platform: The EB-OOD DKT model was implemented in PyTorch (v2.1) and trained on an NVIDIA RTX 3090 GPU (24 GB VRAM). Random seeds were fixed for reproducibility. Xavier uniform initialization and the Adam optimizer (learning rate =

1 \times 10^{- 3}

, batch size = 64) were employed for 50 epochs, with early stopping based on validation AUC. Negative contrastive pairs were generated by randomly matching each in-distribution (ID) sequence in KT1 with an OOD sequence from ASSIST2015 or Algebra 2005–2006.

Parameter Setting: Hyperparameter tuning explored dropout rates {0.3, 0.5, 0.7}, attention heads {4, 8, 16}, transformer layers {2, 3, 4}, and embedding sizes {256, 512, 768}. The optimal configuration (embedding size 512, 3 layers, 8 heads, dropout 0.5) was selected based on the highest mean validation AUC across five random seeds.

Comparison: For comparative analysis, EB-OOD DKT was evaluated against four established baselines: DKT (Piech et al., 2015): Sequential knowledge tracing with LSTM; SAKT (Pandey & Karypis, 2019): Self-attention-based long-term dependency modeling; DKVMN (J. Zhang et al., 2017): Dynamic key–value memory network; Energy-Based OOD Baseline (W. Liu et al., 2020): Non-contrastive OOD scoring approach. All baseline models were trained on the same EdNet-KT1 dataset using identical preprocessing, splitting strategy, and hyperparameter configuration to ensure methodological consistency.

Evaluation Metric: The EB-OOD DKT framework is comprehensively evaluated using multiple metrics, testing not only predictive accuracy but also reliability across both in-distribution (ID) and OOD settings. Three independent metric groups were applied: (i) predictive capability—Accuracy and Area Under the Curve (AUC), (ii) Detection quality on OOD—Area Under the ROC Curve (AUROC) and False Positive Rate at 95% True Positive Rate (FPR@95), and (iii) reliability of calibration—Expected Calibration Error (ECE).

4.3. Results

4.3.1. In-Distribution Evaluation

This subsection aims to assess the performance of EB-OOD DKT on the in-distribution data, EdNet-KT1. The proposed model is then contrasted with a number of baseline models, including the well-known SAKT model and the DKT model. The metrics used for evaluation include Accuracy, AUC, and F1 Score.

From the results obtained in Table 3, it can be noted that EB-OOD DKT outperforms the baseline models in both Accuracy and AUC. EB-OOD DKT achieves an Accuracy of 0.762, outperforming SAKT by 0.021 points. Additionally, it achieves an AUC of 0.847, surpassing SAKT’s AUC of 0.824. This indicates that the incorporation of OOD detection within the energy-based model contributes to the improvement of the model’s effectiveness in modeling long-term dependencies and accurately predicting students’ performance.

4.3.2. Out-of-Distribution Detection

The OOD results of EB-OOD DKT are tested on four benchmark datasets: ASSIST2015, ASSIST2009, Algebra 2005–2006, and Khan Academy. The metrics used to calculate the accuracy of OOD sequence classification are AUROC and FPR@95.

As presented in Table 4, EB-OOD DKT outperforms the baseline methods on AUROC and FPR@95. The model achieves 0.826 AUROC on ASSIST2015 with a margin of 0.049 against SAKT and achieves 0.133 FPR@95 with a gap of 0.023 against SAKT. These scores emphasize the effectiveness of the EB-OOD DKT model at identifying OOD samples and separating them successfully from the ID patterns despite large domain discrepancies.

Figure 3 shows that EB-OOD DKT always gets the highest bars in AUROC plots and the lowest in FPR@95 plots. This makes the robustness even clearer. The separation is most noticeable on Algebra 2005–2006, where traditional KT models show a big drop in performance when the domain changes. In contrast, EB-OOD DKT preserves separability, indicating that its contrastive learning element effectively regularizes the latent representation space for out-of-distribution generalization (Hendrycks & Gimpel, 2017; W. Liu et al., 2020).

4.4. Embedding Visualization

To provide visual insight into the learned representations, Figure 4 presents a t-SNE visualization of the latent embeddings, highlighting the model’s ability to separate in-distribution (ID) and out-of-distribution (OOD) sequences. The ID samples from EdNet-KT1 (blue points) form a tight, well-defined cluster, indicating that the model learns a cohesive and stable representation of the training distribution. This structural coherence reflects the model’s capacity to capture consistent, domain-specific patterns in student interaction sequences. Conversely, the OOD samples from Khan Academy (red points) are distributed more broadly across the latent space, forming a diffuse and peripheral pattern relative to the ID cluster. This dispersion stems from the inherently multi-skill and heterogeneous nature of Khan Academy’s learning trajectories, where students engage with diverse concepts across subjects, represent a compositional shift not observed in the more structured, single-skill sequences of EdNet-KT1. This pronounced separation is a direct outcome of the contrastive learning objective (InfoNCE loss), which encourages the model to cluster ID examples compactly while repelling OOD examples in the embedding space (W. Liu et al., 2020).

The geometric separation between ID and OOD samples confirms that the model is sensitive to domain shifts. This separation supports the energy-based detection mechanism, as embeddings that are farther from the ID cluster receive higher energy scores, which reflects greater prediction uncertainty (W. Liu et al., 2020). The visualization thus provides empirical support for the model’s ability to maintain predictive accuracy on known data while identifying OOD interactions from platforms such as Khan Academy.

4.5. Calibration and Overall Performance

In this section, we evaluate the calibration performance of EB-OOD DKT using the ECE (Futami & Fujisawa, 2024), which measures how well predicted probabilities align with actual outcomes. Firstly, we examine the energy distributions of ID and OOD sequences, shown in Figure 5. The ID sequences are concentrated in lower-energy regions, while the OOD sequences occupy higher-energy regions, with minimal overlap. This clear separation demonstrates that EB-OOD DKT effectively distinguishes between in-domain and out-of-domain sequences. The energy-based detection mechanism allows the model to robustly identify OOD instances, even under significant domain shifts.

As presented in Figure 6, EB-OOD DKT achieves a significantly lower ECE of 0.064, compared with SAKT’s ECE of 0.103, indicating better alignment between predicted probabilities and actual outcomes. This suggests that EB-OOD DKT is more reliable in its predictions, which is critical for adaptive learning systems where confident decision-making is essential.

Additionally, the threshold sensitivity analysis of

τ

across multiple datasets is presented in Figure 7. This analysis evaluates the model’s ability to balance False Positive Rate (FPR) and True Positive Rate (TPR) at various energy thresholds. As seen in Figure 7, EB-OOD DKT achieves favorable performance, maintaining a low FPR@95 across datasets such as ASSIST2009, ASSIST2015, Algebra 2005–2006, and Khan Academy. The threshold sensitivity curve illustrates how the model adapts to varying thresholds and effectively distinguishes between ID and OOD sequences.

The combination of improved calibration, clear separation in energy distributions, and robust threshold sensitivity analysis highlights the EB-OOD DKT model’s ability to handle OOD detection and domain shifts effectively while providing reliable predictions. These findings are summarized in Table 5, where performance metrics like AUROC, FPR@95, and ECE are compared across different models.

4.6. Ablation Study

Ablation experiments have been performed to analyze the influence of components, including contrastive loss, temporal/behavioral features, normalization by energy, and temperature scaling. Each variant was identical to the training setup of the base model. The AUROC of different ablations are shown in Table 6. From these results, removing contrastive loss or behavioral embeddings led to significant AUROC drops, underscoring their critical role in distinguishing ID and OOD samples. Energy normalization enhanced calibration, while moderate temperature scaling ensured stable training (W. Liu et al., 2020; van den Oord et al., 2019). These findings confirm the necessity of each component for robust, well-calibrated OOD detection.

4.7. Parameter Sensitivity Analysis

To assess the impact of key hyperparameters on model performance, we conduct a parameter sensitivity analysis. In this analysis, several important hyperparameters are systematically varied, including the embedding dimension, the number of Transformer layers, the number of attention heads, and the dropout rate. Each parameter is adjusted within a predefined range while the remaining parameters are held fixed, and the model’s performance is evaluated on both in-distribution (ID) and OOD datasets using metrics such as AUC and AUROC.

The embedding dimension was varied over the values

256, 512, 768

to examine how the representation capacity influences the predictive accuracy of the model. The depth of the Transformer encoder was tested by changing the number of layers to

2, 3, 4

, which allowed us to assess its effect on capturing sequential dependencies. The number of attention heads was modified within the set

4, 8, 16

to analyze its impact on the model’s ability to capture complex relationships within the student interaction sequences. Furthermore, the dropout rate was tuned across

0.3, 0.5, 0.7

to find the optimal balance between regularization and learning stability.

Figure 8 summarizes the results of this parameter sensitivity analysis, illustrating the model’s performance across different hyperparameter settings. The analysis indicates that the model achieves stable and high performance within a moderate range of hyperparameter values, while extreme configurations yield diminishing returns or signs of overfitting. These findings confirm the robustness of the proposed EB-OOD DKT model and provide guidance for selecting effective hyperparameter configurations in practical applications.

4.8. Energy Score Distribution Analysis

This section evaluates the energy-based separation across all OOD datasets to examine how effectively the EB-OOD DKT framework distinguishes in-distribution (ID) information from novel inputs. Figure 9 provides the corresponding visualization.

For each dataset, normalized density histograms were constructed for ID samples (EdNet-KT1) and OOD samples (ASSIST2009, ASSIST2015, Algebra 2005–2006, and Khan Academy). In all cases, ID distributions are concentrated at lower energy values, while OOD distributions shift toward higher energies, confirming a consistent separability pattern that supports the reliability of the learned energy landscape.

The gap between ID and OOD samples, ranging from 0.43 to 0.67, confirms that the proposed energy framework consistently separates the two groups across all datasets Table 7. The maximum gap of 0.67 is observed on the Algebra 2005-2006 dataset, where the domain difference from EdNet-KT1 is greatest. These results show that EB-OOD DKT assigns lower energy to ID samples and higher energy to OOD samples, supporting reliable detection under distribution changes.

5. Discussion

The paper introduced EB-OOD DKT, a technically motivated framework that fortifies the reliability of knowledge tracing by combining energy-based OOD detection (Hendrycks & Gimpel, 2017; W. Liu et al., 2020) and contrastive representation learning (Choi et al., 2020) into deep knowledge tracing. On educational datasets, EB-OOD DKT demonstrated substantial improvements in both in-distribution (ID) and OOD scenarios. Sensitivity analyses confirmed the model’s stability over a wide hyperparameter range, while ablation studies verified the complementary impact of contrastive objectives and energy-based scoring on its performance. The energy-based approach offers practical advantages over alternatives Hendrycks and Gimpel (2017); K. Lee et al. (2018): unlike softmax-based confidence scores, energy scores consider the full logit distribution and avoid overconfidence on out-of-distribution inputs; unlike Mahalanobis distance methods K. Lee et al. (2018), the approach requires no additional parameters and scales efficiently with skill taxonomy size. Recent comparative studies Sehwag et al. (2021); Yang et al. (2022) have demonstrated that energy-based methods achieve superior OOD detection across diverse domains while maintaining computational efficiency. However, the framework assumes sufficient training data (10K+ interactions) for reliable energy calibration, and the static threshold

τ

may require periodic recalibration under gradual distributional drift. While effective across four educational platforms, all evaluated datasets use discrete-response formats; applicability to open-ended tasks remains unexplored.

The framework assumes sufficient training data to establish reliable energy distributions; cold-start scenarios with fewer than 5 interactions per student may exhibit unreliable OOD detection as energy scores cannot distinguish true distribution shifts from sparse-data uncertainty. Extension to cold-start settings remains future work.

Future research should generalize this framework for dynamic OOD detection, where curricula and knowledge structures evolve over time, and develop adaptive calibration mechanisms that preserve probabilistic reliability under non-stationary data conditions Gama et al. (2014); Lu et al. (2019); Y. Zhang et al. (2023a, 2025). Continual learning approaches De Lange et al. (2021) and Parisi et al. (2019) offer promising directions for maintaining model performance under curriculum drift while preventing catastrophic forgetting of previously learned patterns. Additionally, evaluating EB-OOD DKT on cross-domain learning datasets beyond mathematics, such as scientific reasoning and language acquisition, which would further confirm its scalability and generality Choi et al. (2020).

6. Conclusions

We introduced EB-OOD DKT, a framework integrating energy-based out-of-distribution detection with transformer-based knowledge tracing. The model achieves reliable student performance prediction while identifying distributional shifts from curriculum changes, new assessment formats, and atypical student behaviors. Experiments demonstrate consistent improvements: consistent accuracy gain over SAKT on in-distribution data, significant AUROC improvement on OOD detection across platforms, and superior calibration (ECE: 0.064 vs. 0.103). The energy-based mechanism provides explicit uncertainty signals alongside predictions, enabling adaptive learning systems to flag unreliable outputs for instructor review. Deployment requires calibrating the energy threshold

τ

on representative validation data from the target platform and monitoring OOD detection rates to ensure stable operation under curriculum changes.

Several limitations warrant attention. The contrastive learning component increases training time by approximately 30% compared with baseline DKT due to contrastive pair sampling and dual-loss optimization. However, inference complexity remains

O (T \cdot d^{2})

identical to baseline SAKT, as energy computation via Equation (7) adds only a log-sum-exp operation over output logits with negligible latency overhead. The energy threshold

τ

requires validation data from target domains for optimal calibration. The framework assumes static curriculum structures and does not address continual learning scenarios where the notion of in-distribution evolves over time.

Future research should explore efficient negative sampling strategies to reduce training overhead, develop domain-agnostic threshold calibration methods, extend evaluation to diverse subject areas beyond mathematics, and investigate adaptive energy landscapes for non-stationary educational environments, enabling systems to recognize the limits of their knowledge and request human oversight when faced with unfamiliar patterns.

Author Contributions

Conceptualization, Y.Z.; methodology, R.H.; software, R.H.; validation, R.H.; formal analysis, R.H.; investigation, R.H. and Y.Z.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, R.H.; writing—review and editing, R.H. and Y.Z.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China: 62272392 and 62537001.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are all available publicly.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DKT	Deep Knowledge Tracing
OOD	Out-of-Distribution
AUC	Area Under the Curve
BCE	Binary Cross-Entropy
KT	Knowledge Tracing
SAKT	Self-Attentive Knowledge Tracing
DKVMN	Dynamic Key–Value Memory Network
BKT	Bayesian Knowledge Tracing
IRT	Item Response Theory
LSTM	Long Short-Term Memory
RNN	Recurrent Neural Network
EBM	Energy-Based Model
AUROC	Area Under the Receiver Operating Characteristic Curve
FPR	False Positive Rate
TPR	True Positive Rate
ECE	Expected Calibration Error
CRL	Contrastive Representation Learning
ID	In-Distribution
t-SNE	t-distributed Stochastic Neighbor Embedding
GPU	Graphics Processing Unit
VRAM	Video Random Access Memory
NLSE	Negative Log-Sum-Exponential

Notes

1	https://www.kaggle.com/datasets/gmhost/ednetkt1 (accessed on 15 November 2025).
2	https://sites.google.com/site/assistmentsdata/ (accessed on 1 November 2025).
3	https://pslcdatashop.web.cmu.edu/KDDCup/login (accessed on 1 November 2025).
4	https://www.kaggle.com/datasets/fernandosr85/khan-academy-exercises (accessed on 25 October 2025).

References

Abdelrahman, G., Wang, Q., & Nunes, B. (2023). Knowledge tracing: A survey. ACM Computing Surveys, 55(11), 1–37. [Google Scholar] [CrossRef]
Baker, R. S., & Inventado, P. S. (2014). Educational data mining and learning analytics. In J. A. Larusson, & B. White (Eds.), Learning analytics: From research to practice (pp. 61–75). Springer. [Google Scholar] [CrossRef]
Bengio, Y., Courville, A., & Vincent, P. (2009). Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1–127. [Google Scholar] [CrossRef]
Choi, Y., Lee, Y., Shin, D., Cho, J., Park, S., Lee, S., Baek, J., Bae, C., Kim, B., & Heo, J. (2020, July 6–10). EdNet: A large-scale hierarchical dataset in education. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial Intelligence in Education: 21st International Conference, AIED 2020, Proceedings, Part II (Vol. 12164, pp. 69–73). Lecture Notes in Artificial Intelligence. Springer. [Google Scholar] [CrossRef]
Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4, 253–278. [Google Scholar] [CrossRef]
Dai, H., Yun, Y., Zhang, Y., Zhang, W., & Shang, X. (2022). Contrastive deep knowledge tracing. In International conference on artificial intelligence in education (pp. 289–292). Springer. [Google Scholar]
De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3366–3385. [Google Scholar] [CrossRef] [PubMed]
D’Mello, S. K., & Graesser, A. C. (2015). Feeling, thinking, and computing with affect-aware learning technologies. In The oxford handbook of affective computing (pp. 419–434). Oxford Academic. [Google Scholar]
Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Psychology Press. [Google Scholar]
Feng, M., Heffernan, N., & Koedinger, K. (2009). Addressing the assessment challenge with an online system that tutors as it assesses. User Modeling and User-Adapted Interaction, 19, 243–266. [Google Scholar] [CrossRef]
Futami, F., & Fujisawa, M. (2024). Information-theoretic generalization analysis for expected calibration error. Advances in Neural Information Processing Systems, 37, 84246–84297. [Google Scholar]
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In ICML’16: Proceedings of the 33rd international conference on international conference on machine learning—Volume 48 (pp. 1050–1059). JMLR.org. [Google Scholar]
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 1–37. [Google Scholar] [CrossRef]
Gervet, T., Koedinger, K., Schneider, J., & Mitchell, T. (2020). When is deep learning the best approach to knowledge tracing? Journal of Educational Data Mining, 12(3), 31–54. [Google Scholar]
Ghosh, A., Heffernan, N., & Lan, A. S. (2020). Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2330–2339). Association for Computing Machinery. [Google Scholar]
Hendrycks, D., & Gimpel, K. (2017, April 24–26). A baseline for detecting misclassified and out-of-distribution examples in neural networks. International Conference on Learning Representations (ICLR), Toulon, France. [Google Scholar]
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS’17: Proceedings of the 31st international conference on neural information processing systems (Vol. 30, pp. 6402–6413). Curran Associates Inc. [Google Scholar]
Lee, K., Lee, K., Lee, H., & Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In NIPS’18: Proceedings of the 32nd international conference on neural information processing systems (Vol. 31, pp. 7167–7177). Curran Associates Inc. [Google Scholar]
Lee, W., Chun, J., Lee, Y., Park, K., & Park, S. (2022). Contrastive learning for knowledge tracing. In Proceedings of the ACM web conference 2022 (pp. 2330–2338). Association for Computing Machinery. [Google Scholar]
Liu, W., Wang, X., Owens, J. D., & Li, Y. (2020). Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems (NeurIPS), 33, 21464–21475. [Google Scholar]
Liu, Y., Yang, Y., Chen, X., Shen, J., Zhang, H., & Yu, Y. (2020, July 7–15). Improving knowledge tracing via pre-training question embeddings. Twenty-Ninth International Joint Conference on Artificial Intelligence (pp. 1577–1583), Yokohama, Japan. [Google Scholar]
Liu, Z., Guo, T., Liang, Q., Hou, M., Zhan, B., Tang, J., Luo, W., & Weng, J. (2025). Deep learning based knowledge tracing: A review, a tool and empirical studies. IEEE Transactions on Knowledge & Data Engineering, 37, 4512–4536. [Google Scholar]
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363. [Google Scholar] [CrossRef]
Mei, Y., Wang, X., Sun, C., Zhang, D., & Wang, X. (2025). Multi-label out-of-distribution detection with spectral normalized joint energy. World Wide Web, 28(4), 40. [Google Scholar] [CrossRef]
Nagatani, K., Zhang, Q., Sato, M., Chen, Y.-Y., Chen, F., & Ohkuma, T. (2019). Augmenting knowledge tracing by considering forgetting behavior. In WWW ’19: The world wide web conference (pp. 3101–3107). Association for Computing Machinery. [Google Scholar]
Pandey, S. K., & Karypis, G. (2019, July 2–5). A Self-Attentive model for Knowledge Tracing. 12th International Conference on Educational Data Mining (EDM), Montreal, QC, Canada. [Google Scholar]
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]
Pavlik, P. I., Cen, H., & Koedinger, K. R. (2009, July 6–10). Performance factors analysis—A new alternative to knowledge tracing. 14th International Conference on Artificial Intelligence in Education, Brighton, UK. [Google Scholar]
Piech, C., Spencer, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L., & Sohl-Dickstein, J. (2015). Deep knowledge tracing. Advances in Neural Information Processing Systems (NeurIPS), 28, 1–12. [Google Scholar]
Reckase, M. D. (2009). Multidimensional Item Response Theory. Springer. [Google Scholar]
Scarlatos, A., Baker, R. S., & Lan, A. (2025). Exploring knowledge tracing in tutor-student dialogues using LLMs. In Proceedings of the 15th international learning analytics and knowledge conference (pp. 249–259). Association for Computing Machinery. [Google Scholar]
Sehwag, V., Chiang, M., & Mittal, P. (2021, May 3–7). SSD: A unified framework for self-supervised outlier detection. International Conference on Learning Representations, Virtual. [Google Scholar]
Shen, S., Liu, Q., Huang, Z., Zheng, Y., Yin, M., Wang, M., & Chen, E. (2024). A survey of knowledge tracing: Models, variants, and applications. IEEE Transactions on Learning Technologies, 17, 1858–1879. [Google Scholar] [CrossRef]
Sun, X., Zhang, K., Liu, Q., Shen, S., Wang, F., Guo, Y., & Chen, E. (2025). DASKT: A dynamic affect simulation method for knowledge tracing. IEEE Transactions on Knowledge & Data Engineering, 37(4), 1714–1727. [Google Scholar]
van den Oord, A., Li, Y., & Vinyals, O. (2019). Representation learning with contrastive predictive coding. arXiv, arXiv:1807.03748. [Google Scholar] [CrossRef]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 5998–6008. [Google Scholar]
Vie, J., & Kashima, H. (2019). Knowledge tracing machines: Factorization machines for knowledge tracing. In Proceedings of the 12th international conference on educational data mining (AAI2019). AAAI Press. [Google Scholar]
Wang, Z., Xu, B., Yuan, Y., Shen, H., & Cheng, X. (2025). InfoNCE is a free lunch for semantically guided graph contrastive learning. In Proceedings of the 48th international ACM SIGIR conference on research and development in information retrieval (pp. 719–728). Association for Computing Machinery. [Google Scholar]
Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., & Weinberger, K. Q. (2019). Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning (ICML) (pp. 6861–6871). PMLR. [Google Scholar]
Yang, J., Zhou, K., Li, Y., & Liu, Z. (2024). Generalized out-of-distribution detection: A survey. arXiv, arXiv:2110.11334. [Google Scholar] [CrossRef]
Yeung, C.-K., & Yeung, D.-Y. (2018). Addressing two problems in deep knowledge tracing via prediction-consistent regularization. In Proceedings of the fifth annual ACM conference on learning at scale (pp. 1–10). Association for Computing Machinery. [Google Scholar]
Zhang, J., Shi, X., King, I., & Yeung, D.-Y. (2017). Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th international conference on world wide web (WWW) (pp. 765–774). International World Wide Web Conferences Steering Committee. [Google Scholar]
Zhang, L., Xiong, X., Zhao, S., Botelho, A., & Heffernan, N. T. (2017). Incorporating rich features into deep knowledge tracing. In Proceedings of the fourth ACM conference on learning @ scale (pp. 169–172). Association for Computing Machinery. [Google Scholar]
Zhang, W., Zhang, Y., Liu, S., & Shang, X. (2022). Online deep knowledge tracing. In 2022 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 292–297). IEEE. [Google Scholar] [CrossRef]
Zhang, Y., An, R., Liu, S., Cui, J., & Shang, X. (2023a). Predicting and understanding student learning performance using multi-source sparse attention convolutional neural networks. IEEE Transactions on Big Data, 9(1), 118–132. [Google Scholar] [CrossRef]
Zhang, Y., An, R., Zhang, W., Liu, S., & Shang, X. (2023b). Deep knowledge tracing with concept trees. In International conference on advanced data mining and applications (pp. 377–390). Springer. [Google Scholar]
Zhang, Y., Dai, H., Yun, Y., Liu, S., Lan, A., & Shang, X. (2020). Meta-knowledge dictionary learning on 1-bit response data for student knowledge diagnosis. Knowledge-Based Systems, 205, 106290. [Google Scholar] [CrossRef]
Zhang, Y., Qu, X., Liu, S., Pang, Y., & Shang, X. (2025). Multiscale weisfeiler-leman directed graph neural networks for prerequisite-link prediction. IEEE Transactions on Knowledge and Data Engineering, 37(6), 3556–3569. [Google Scholar] [CrossRef]
Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., & Sun, M. (2021). Graph neural networks: A review of methods and applications. arXiv, arXiv:1812.08434. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed EB-OOD DKT model. Student interactions comprising skill (s), attempt (a), time (t), and response (r) are processed by a Transformer Encoder. Orange and blue blocks denote OOD detection and prediction components, respectively. Solid and dashed arrows indicate inference and training-only signals, respectively.

Figure 2. The training and inference workflow of the proposed EB-OOD DKT model. Input features are encoded and processed by a Transformer Encoder;

E (x_{t})

classifies sequences as ID (

E (x_{t}) < τ

) or OOD (

E (x_{t}) > τ

). Blue circles and green triangles denote ID and OOD samples, respectively. Solid and dashed arrows indicate inference and training-only signals, respectively.

Figure 2. The training and inference workflow of the proposed EB-OOD DKT model. Input features are encoded and processed by a Transformer Encoder;

E (x_{t})

classifies sequences as ID (

E (x_{t}) < τ

) or OOD (

E (x_{t}) > τ

). Blue circles and green triangles denote ID and OOD samples, respectively. Solid and dashed arrows indicate inference and training-only signals, respectively.

Figure 3. (a) AUROC and (b) FPR@95 comparisons. EB-OOD DKT yields consistently higher AUROC and lower FPR, demonstrating robust OOD detection capability. Source: Authors.

Figure 4. t-SNE visual representation for latent representation.

Figure 5. Energy distributions for ID and OOD datasets. EB-OOD DKT assigns higher energy values to OOD samples, with minimal overlap between ID and OOD regions, confirming the model’s effectiveness at separating in-domain and out-of-domain data.

Figure 6. Calibration curve (Reliability Diagram) for EB-OOD DKT. The model achieves lower ECE (0.064), indicating better alignment between predicted probabilities and actual accuracy.

Figure 7. Threshold sensitivity analysis across datasets. The EB-OOD DKT model shows favorable performance with low FPR@95, indicating effective OOD detection and domain separation across multiple datasets.

Figure 8. Parameter sensitivity analysis of embedding size, Transformer depth, attention heads, and dropout rate on ID and OOD performance.

Figure 9. Energy score distributions across OOD datasets. Normalized density histograms of energy scores for in-distribution (EdNet-KT1, blue) and out-of-distribution datasets (red): (a) ASSIST2009, (b) ASSIST2015, (c) Algebra 2005–2006, and (d) Khan Academy.Across all domains, EB-OOD DKT produces clearly separated distributions, assigning lower energy to known sequences and higher energy to unseen patterns, verifying strong OOD discrimination through its learned energy landscape.

Table 1. Notation and definitions of symbols used in the paper.

Symbol	Definition
$x_{t}$	Input at time step t representing the student interaction tuple $(s_{t}, r_{t}, Δ t, a_{t})$
$s_{t}$	Skill identifier corresponding to the learning concept attempted at step t
$r_{t}$	Binary correctness indicator (1 = correct, 0 = incorrect)
$Δ t$	Normalized time interval between successive interactions
$a_{t}$	Cumulative number of attempts made by the student on skill $s_{t}$
$e_{t}$	Embedded vector combining cognitive (skill, response) and behavioral (time, attempt) features
$h_{t}$	Transformer encoder hidden state representing contextualized knowledge at time t
$W, b$	Learnable weights and bias for the prediction layer (used in $r_{t + 1} = σ (W h_{t} + b)$ )
$r_{t + 1}$	Predicted probability of correctness for the next interaction
$L_{B C E}$	Binary Cross-Entropy loss for correctness prediction
$L_{c o n t r a s t i v e}$	InfoNCE-based contrastive loss for latent representation separation
$L_{t o t a l}$	Joint loss combining $L_{B C E}$ and $L_{c o n t r a s t i v e}$ with weight $λ$
$E (x)$	Energy score computed from model logits for OOD detection (negative log-sum-exp)
T	Temperature parameter controlling logit scaling and smoothness
$μ_{E}, σ_{E}$	Mean and standard deviation of energy values on training data for normalization
$τ$	Energy threshold for OOD classification decision
$λ$	Hyperparameter balancing contrastive and prediction losses
$θ$	Model parameters optimized via Adam optimizer
$OOD (x_{t})$	Out-of-Distribution indicator (1 if OOD, 0 otherwise) for input $x_{t}$

Table 2. Summary of datasets used for evaluation. OOD datasets introduce increasing degrees of domain and platform shift relative to EdNet-KT1.

Dataset	Role	Students	Interactions	Primary Domain
EdNet-KT1	ID	5000	>1,000,000	Multi-Subject (K–12)
ASSIST2009	OOD	4151	25,637	Algebra
ASSIST2015	OOD	3800	94,675	Algebra (Word, Multi-step)
Algebra 2005–2006	OOD	3000	4000	Cognitive Tutor
Khan Academy	OOD	2800	24,000	Multi-Skill Sequence

Table 3. In-distribution performance on EdNet-KT1. EB-OOD DKT demonstrates superior performance, improving AUC by 0.023 over SAKT and yielding the highest overall accuracy.

Model	Accuracy	AUC
DKT	0.734	0.812
SAKT	0.741	0.824
DKVMN	0.728	0.805
EB-OOD DKT	0.762	0.847

Table 4. OOD Detection Performance (AUROC↑, FPR@95↓) on Multiple Datasets. EB-OOD DKT achieves the best overall separability under domain shifts.

Dataset	AUROC↑	FPR@95↓
ASSIST2009	0.812	0.28
ASSIST2015	0.826	0.26
Algebra 2005–2006	0.803	0.29
Khan Academy	0.815	0.27
↑ higher is better; ↓ lower is better.

Table 5. Consolidated performance summary. The “Best Baseline” column represents the top-performing model among DKT, SAKT, and DKVMN for each metric.

Metric/Dataset	Best Baseline	EB-OOD DKT (Ours)
Accuracy (EdNet-KT1)	0.741 (SAKT)	0.762
AUC (EdNet-KT1)	0.824 (SAKT)	0.847
AUROC (ASSIST2015)	0.743 (SAKT)	0.826
AUROC (Khan Academy)	0.718 (SAKT)	0.815
FPR@95 (Algebra 2005–2006)	0.42 (SAKT)	0.29
ECE (Calibration)	0.088 (SAKT)	0.064

Table 6. Ablation study results.

Configuration	AUROC
Full Model (EB-OOD DKT)	0.94
w/o Contrastive Loss	0.88
w/o Temporal and Behavioral Features	0.85
w/o Energy Normalization	0.89
Low Temperature ( $τ = 0.5$ )	0.86
High Temperature ( $τ = 2.0$ )	0.87

Table 7. Mean energy scores for ID and OOD samples.

Dataset	Mean Energy (ID)	Mean Energy (OOD)	Difference
ASSIST2009	1.52	2.08	0.56
ASSIST2015	1.48	1.91	0.43
Algebra 2005–2006	1.45	2.12	0.67
Khan Academy	1.50	2.05	0.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hasan, R.; Zhang, Y. Robust Deep Knowledge Tracing with Out-of-Distribution Detection. AI Educ. 2026, 2, 6. https://doi.org/10.3390/aieduc2010006

AMA Style

Hasan R, Zhang Y. Robust Deep Knowledge Tracing with Out-of-Distribution Detection. AI in Education. 2026; 2(1):6. https://doi.org/10.3390/aieduc2010006

Chicago/Turabian Style

Hasan, Riyan, and Yupei Zhang. 2026. "Robust Deep Knowledge Tracing with Out-of-Distribution Detection" AI in Education 2, no. 1: 6. https://doi.org/10.3390/aieduc2010006

APA Style

Hasan, R., & Zhang, Y. (2026). Robust Deep Knowledge Tracing with Out-of-Distribution Detection. AI in Education, 2(1), 6. https://doi.org/10.3390/aieduc2010006

Article Menu

Robust Deep Knowledge Tracing with Out-of-Distribution Detection

Abstract

1. Introduction

2. Related Work

2.1. Deep Knowledge Tracing

2.2. Energy-Based OOD Detection: Background

2.3. Problem

3. The Proposed Method

3.1. Input Representation

3.2. Sequence Modeling with Transformer

3.3. Energy-Based OOD Detection

3.4. Contrastive Loss Function

3.5. The Proposed Algorithm

4. Experimental Results

4.1. Datasets

4.2. Experimental Setup

4.3. Results

4.3.1. In-Distribution Evaluation

4.3.2. Out-of-Distribution Detection

4.4. Embedding Visualization

4.5. Calibration and Overall Performance

4.6. Ablation Study

4.7. Parameter Sensitivity Analysis

4.8. Energy Score Distribution Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI