ANT-KT: Adaptive NAS Transformers for Knowledge Tracing

Yao, Shuanglong; Song, Yichen; Liu, Ye; Chen, Ji; Zhao, Deyu; Wang, Xing

doi:10.3390/electronics14214148

Open AccessArticle

ANT-KT: Adaptive NAS Transformers for Knowledge Tracing

by

Shuanglong Yao

,

Yichen Song

,

Ye Liu

,

Ji Chen

^*,

Deyu Zhao

^* and

Xing Wang

College of Information Science & Engineering, Linyi University, Linyi 276012, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(21), 4148; https://doi.org/10.3390/electronics14214148

Submission received: 12 September 2025 / Revised: 8 October 2025 / Accepted: 9 October 2025 / Published: 23 October 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Knowledge Tracing aims to assess students’ mastery of knowledge concepts in real time, playing a crucial role in providing personalized learning services in intelligent tutoring systems. In recent years, researchers have attempted to introduce Neural Architecture Search (NAS) into knowledge tracing tasks to automatically design more efficient network structures. However, existing NAS-based methods for Knowledge Tracing suffer from excessively large search spaces and slow search efficiency, which significantly constrain their practical applications. To address these limitations, this paper proposes an Adaptive Neural Architecture Search framework based on Transformers for KT, called ANT-KT. Specifically, we design an enhanced encoder that combines convolution operations with state vectors to capture both local and global dependencies in students’ learning sequences. Moreover, an optimized decoder with a linear attention mechanism is introduced to improve the efficiency of modeling long-term student knowledge state evolution. We further propose an evolutionary NAS algorithm that incorporates a model optimization efficiency objective and a dynamic search space reduction strategy, enabling the discovery of high-performing yet computationally efficient architectures. Experimental results on two large-scale real-world datasets, EdNet and RAIEd2020, demonstrate that ANT-KT significantly reduces time costs across all stages of NAS while achieving performance improvements on multiple evaluation metrics, validating the efficiency and practicality of the proposed method.

Keywords:

Knowledge Tracing; NAS; Transformer; adaptive

1. Introduction

Knowledge Tracing (KT) [1] aims to assess the mastery of knowledge points by the student in real time and is one of the key links in the realization of personalized adaptive learning [2]. With the rapid development of online education platforms [3,4], the huge amount of data from learning records provides a brand new opportunity to build Knowledge Tracing models. By modeling student exercise-solving sequences, KT can be used to dynamically predict students’ knowledge states, providing an important basis for downstream tasks such as the learning path planning and recommendation of learning material. In recent years, deep learning has witnessed substantial advancements in the domain of KT. Nevertheless, prevailing models continue to exhibit limitations in capturing students’ forgetting behavior, extracting structural information between exercises and knowledge concepts, and other pertinent aspects. Consequently, the exploration of more effective KT models is imperative to improve the intelligence of adaptive learning systems.

Research on Knowledge Tracing dates back to the 1990s with the introduction of the Bayesian Knowledge Tracing (BKT) model [5]. Subsequently, various probabilistic graphical models were proposed, including dynamic Bayesian networks [6] and hidden Markov models [7]. While these models describe the state transitions of student knowledge using predefined rules, they have limitations in representation and generalization. Deep learning has since revolutionized Knowledge Tracing. The Deep Knowledge Tracing (DKT) model [8] pioneered the use of RNNs to model student interactions as sequences. Dynamic Key-Value Memory Network (DKVMN) [9] further has improved this by introducing memory mechanisms to capture the dynamic changes of knowledge state. The Self-Attentive Knowledge Tracing (SAKT) model, as outlined in the seminal work of Pandey et al. [10], employs a sophisticated form of self-attention to model inter-question relationships. Meanwhile, the Separated Self-Attentive Neural Knowledge Tracing (SAINT) model [11] marked a significant leap. It applied Transformers to knowledge tracing, achieving substantial performance improvements through complex exercise interaction modeling. Despite these advancements in characterizing learning processes and knowledge point relationships, most methods relied on fixed network structures, overlooking the potential of architecture optimization. Recent work [12] has explored applying Neural Architecture Search (NAS) [13] to Transformer-based knowledge tracing models, aiming for automatic optimal architecture discovery. However, the computational demands and time requirements of NAS methods present challenges for efficient and practical application in knowledge tracing.

To address the limitations of applying neural architecture search (NAS) to KT tasks, this paper proposes an Adaptive Neural Architecture Search framework based on Transformers (ANT-KT). Traditional Transformer encoders primarily rely on self-attention mechanisms to model sequence dependencies. However, self-attention has limitations in capturing local features. To overcome this issue, an enhanced encoder is designed that combines convolution operations [14] with state vectors [15], enabling the simultaneous modeling of both local and global dependencies in sequences. This allows the encoder to better capture short-term changes in students’ knowledge states and incremental progress during the learning process. Furthermore, to improve efficiency when handling long sequences, an optimized decoder is proposed that introduces a linear attention mechanism. Unlike traditional decoders that compute self-attention through pairwise comparisons across all sequence positions, the linear attention transforms the attention computation into matrix multiplication via linear mappings. This significantly reduces time complexity while maintaining performance, enhancing the model’s ability to capture long-term dependencies and more accurately predict knowledge state changes over extended learning periods. Moreover, to enhance search efficiency, an improved evolutionary algorithm is introduced. Existing evolutionary algorithms typically optimize model performance in isolation, often ignoring the balance between model performance and computational efficiency [16]. To address this, a model optimization efficiency objective is designed within the evolutionary algorithm, establishing a balanced optimization function considering both performance and computational efficiency. Additionally, a dynamic search space reduction strategy is adopted to adjust the search space during the search process, enabling more efficient discovery of high-performing and computationally efficient architectures. The main contributions of this paper are summarized as follows:

Proposes ANT-KT, an adaptive neural architecture search architecture based on Transformer for Knowledge Tracing tasks. ANT-KT integrates an innovative Transformer variant and an improved evolutionary algorithm to enhance model performance and architecture search efficiency.
Designs an enhanced encoder–decoder architecture that strengthens the modeling of local and global dependencies. The encoder introduces state vectors combined with convolution operations to capture short-term knowledge state transitions, while the decoder employs a novel linear attention mechanism for efficient fusion of long-term global information, enabling more accurate modeling of students’ learning dynamics over time.
Designs an improved evolutionary algorithm to further enhance search efficiency. An optimization efficiency objective in the model is introduced to establish a balanced optimization objective function between performance and computational efficiency. Additionally, a dynamic reduction strategy is adopted to dynamically adjust the search space during the search process, enabling a more efficient discovery of neural network architectures that are both high-performance and computationally efficient.
Conducts extensive experiments on two large-scale real-world datasets, EdNet and RAIEd2020, and compares with multiple existing methods. The results demonstrate that the proposed method achieves improvements across multiple evaluation metrics while also enhancing search efficiency, confirming the framework’s advantages in both performance and efficiency.

2. Related Work

Knowledge Tracing: Knowledge Tracing has evolved from Bayesian parameter estimation [5] to sophisticated deep learning paradigms that capture temporal dynamics and structural dependencies. Early deep approaches like [8] pioneered RNN/LSTM-based sequential modeling, later extended with temporal features [17,18] and enhanced loss functions [19,20]. Concurrently, structural modeling emerged through graph networks [21,22] encoding concept–question relationships and memory-augmented architectures [9,23] storing long-term dependencies. The advent of Transformer-based models (SAKT [10], AKT [24]) introduced attention-driven interaction weighting, further advanced by hybrid approaches like [25] dynamically fusing temporal trajectories with graph structures and [26] integrating language models for inductive knowledge inference. And recently, Large Language Models (LLMs) have been explored for knowledge tracing tasks [26,27]. For instance, SINKT [26] integrates LLMs for inductive knowledge inference by leveraging pre-trained language understanding, while [27] investigates zero-shot KT using GPT-based models. Although these approaches show promise in handling cold-start scenarios and cross-domain generalization, they face challenges in computational overhead and interpretability. Despite these advances, existing KT models still struggle to balance prediction accuracy and efficiency, highlighting the need for adaptive architectures that can better capture students’ evolving knowledge states.

However, existing work is mainly limited to the original Transformer architecture, ignoring the latest progress in structural optimization.

NAS in Knowledge Tracing: Although deep Knowledge Tracing models have achieved remarkable performance gains in predicting students’ learning states, their architectures are still predominantly designed manually, which makes it difficult to ensure a globally optimal structure. To address this limitation and further improve model performance while reducing the burden of manual design, researchers have increasingly turned to Neural Architecture Search (NAS) [13] techniques in the context of Knowledge Tracing. In particular, Ding et al. pioneered this line of research by introducing the NAS-Cell method [28], which employs reinforcement learning to search for optimal LSTM cell structures. Their approach yielded architectures that surpass manually designed counterparts, leading to significant improvements across multiple Knowledge Tracing tasks. Building on this foundation, Yang et al. further advanced the integration of NAS into Knowledge Tracing by proposing ENAS-KT [12], which not only adopts a more comprehensive and systematic search strategy but also constructs a task-specific search space tailored to the characteristics of Knowledge Tracing. Collectively, these studies highlight both the feasibility and the promising potential of NAS in driving the next stage of development in Knowledge Tracing research.

Moreover, existing NAS-based KT approaches primarily focus on improving predictive accuracy, while neglecting the efficiency of the search process and the resulting model. This imbalance hinders their scalability to large-scale educational datasets and real-time applications. To address these limitations, this paper proposes ANT-KT, an adaptive NAS framework based on Transformer architectures. By introducing an enhanced encoder–decoder structure and an efficiency-aware evolutionary search strategy, ANT-KT achieves a better balance between model performance and computational efficiency, as demonstrated through extensive experiments on large-scale datasets.

3. Methodology

3.1. Problem Definition

Let

S = {s_{1}, s_{2}, \dots, s_{n}}

be a student’s interaction sequence, where

s_{i} = (e_{i}, t_{i}, r_{i})

for

i = 1, 2, \dots, n

. Here, i represents the index of the interaction step in the sequence,

e_{i}

denotes the exercise attempted,

t_{i}

indicates the response time, and

r_{i} \in {0, 1}

represents the response correctness (

r_{i} = 1

for correct,

r_{i} = 0

for incorrect). Given an interaction sequence S, the goal of Knowledge Tracing is to correctly predict the probability

P (r_{n + 1} = 1 | S, e_{n + 1})

of the answer for the next exercise

e_{n + 1}

. Obviously, how to accurately model the evolving knowledge state of students based on their historical interaction data S is a key factor affecting the prediction accuracy.

3.2. Proposed Method: ANT-KT Framework

Since the students’ historical interaction datasets are collected based on their mastery of knowledge during the learning process, they inherently possess time-series characteristics. Furthermore, drawing inspiration from the excellent performance of Transformer in language sequence generation tasks within the field of natural language processing, this paper proposes an Adaptive Neural Architecture Search framework based on the Transformer (ANT-KT) for Knowledge Tracing tasks. As shown in Figure 1, the ANT-KT framework consists of three components: (1) a selective hierarchical input module, which is designed to extract and select feature inputs of different granularities and types; (2) an encoder, in which state vectors and convolutional operations are combined to capture both local and global dependencies within the exercise sequence; and (3) a decoder, where the information obtained from the encoder is leveraged to predict students’ knowledge states by introducing linear attention and a selective output mechanism.

3.3. Adaptive Embedding Selection Module

The adaptive embedding selection module is designed to adaptively select the most relevant embedding subset from the candidate embedding set

X_{e m b e d} = {{embed}_{i} \in R^{n \times D} ∣ 1 \leq i \leq N u m}

according to a given student’s interaction sequence S, for subsequent processing by the encoder and decoder. Herein, n denotes the length of the student’s interaction sequence S, D represents the dimension of each embedding and

N u m

is the total number of candidate embeddings. Meanwhile, the candidate embedding set

X_{e m b e d}

is derived by performing embedding mapping on discrete features (such as exercise

e_{i}

) in the student interaction sequence S and normalizing continuous features (such as response time

t_{i}

).

To achieve adaptive selection, the module introduces two binary vectors

b_{E n} \in {0, 1}^{1 \times N u m}

and

b_{D e} \in {0, 1}^{1 \times N u m}

as the embedding selectors for the encoder and decoder, respectively. Each element

b_{i} \in {0, 1}

indicates whether the i-th candidate embedding

{embed}_{i}

is selected. In other words,

b_{i} = 1

indicates that the candidate embedding vector

{embed}_{i}

is selected; otherwise, it is not selected. To ensure non-empty inputs for the encoder and decoder, the selectors need to satisfy the following constraints:

\begin{matrix} 1 \leq & \sum_{i = 1}^{N u m} b_{i}, b_{i} \in b_{E n}, \\ 1 \leq & \sum_{i = 1}^{N u m} b_{i}, b_{i} \in b_{D e} . \end{matrix}

(1)

Based on selectors

b_{E n}

and

b_{D e}

, two embedding subsets

X_{E n}

and

X_{D e}

can be obtained from the candidate embedding set

X_{e m b e d}

:

\begin{matrix} X_{E n} & = {{embed}_{i} ∣ {embed}_{i} \in X_{e m b e d}, b_{i} = 1, b_{i} \in b_{E n}}, \\ X_{D e} & = {{embed}_{i} ∣ {embed}_{i} \in X_{e m b e d}, b_{i} = 1, b_{i} \in b_{D e}} . \end{matrix}

(2)

This candidate embedding sets are employed the hierarchical embedding fusion strategy

HierFuse (\cdot)

proposed by [12]. In this strategy, the selected embeddings are first pairwise fused and then are fed into a linear transformation to map the fused features into a tensor of dimension

n \times D

, thereby generating the inputs

H_{En}^{0}

and

H_{De}^{0}

for the encoder and decoder:

\begin{matrix} H_{E n}^{0} & = HierFuse (X_{E n}), \\ H_{D e}^{0} & = HierFuse (X_{D e}) . \end{matrix}

(3)

In the module, the

HierFuse (\cdot)

can be described as follows:

HierFuse (X) : output = Linear (Concat (temp)),

(4)

and

temp = ⋃_{x_{i}, x_{j} \in X, i \neq j} Tanh (Linear ([x_{i}, x_{j}])) .

(5)

where

[x_{i}, x_{j}]

represents the concatenation of embeddings

x_{i}

and

x_{j}

,

Tanh (\cdot)

is the hyperbolic tangent activation function,

Linear (\cdot)

denotes a linear transformation,

temp

is a temporary variable storing the pairwise fused embeddings, and

output

is the final output tensor of dimension

n \times D

generated by the

HierFuse (\cdot)

function.

Through the adaptive embedding selection module, ANT-KT can flexibly select the most relevant subset of embeddings according to the requirements of the Knowledge Tracing task and integrate them into the optimal input representations for the encoder and decoder using the hierarchical embedding fusion strategy.

3.3.1. Enhanced Encoder

The enhanced encoder extends the Transformer encoder by introducing mechanisms to better capture both global and local dependencies in student interaction sequences. While self-attention effectively models long-range relations, it overlooks short-term patterns. To address this, we integrate convolutions and state vectors, enabling the encoder to jointly capture fine-grained learning fluctuations and overall knowledge evolution, thus providing richer representations for subsequent prediction. This module consists of

N_{E}

stacked encoder units, which progressively model local and global dependencies in the student interaction sequence.

Corresponding to the input definition of the encoder in Formula (3), let

H_{En}^{i}

denote the output of the i-th encoder unit, where

H_{En}^{0}

represents the input feature embedding generated by the selective input module. Hence, the calculation process of the encoder unit is as follows:

Z_{En}^{i} = Mamba (H_{En}^{i - 1}),

(6)

where

Z_{En}^{i}

denotes the intermediate global feature representation and are extracted by the Mamba, which processes the output from the previous encoder unit. By using selective state space selection, the Mamba module can effectively capture the global features to better model the student state. It employs a global state dimension

d_{state} = 16

and a convolutional channel size

d_{conv} = 4

, enabling efficient modeling of long-range dependencies in the interaction sequence.

Next, local features of the sequence are captured by a one-dimensional convolution operation:

C_{En}^{i} = H_{En}^{i - 1} + Conv 1 d (H_{En}^{i - 1}),

(7)

where a kernel size 4 is applied by the Conv1d to model short-term dependencies. And the residual connection is introduced to preserving local features for subsequent computations.

After local feature extraction, the hybrid attention mechanism

M_{En}^{i}

integrates multiple types of feature interactions. The hybrid mechanism dynamically combines different candidate operations (e.g., self-attention, gated convolution) as follows:

M_{En}^{i} (x) = \sum_{k = 1}^{K} α_{k}^{i} o_{k} (x),

(8)

where

o_{k} (x)

represents the k-th candidate operation, and

α_{k}^{i}

is a learnable weight normalized using the softmax function, ensuring

\sum_{k = 1}^{K} α_{k}^{i} = 1

. This mechanism allows the encoder to adaptively adjust to varying sequence patterns, effectively integrating global and local dependencies.

The output of the hybrid attention mechanism is processed using residual connections and layer normalization to stabilize training:

H_{En}^{i} = LN (C_{En}^{i} + M_{En}^{i} (C_{En}^{i})),

(9)

where LN denotes layer normalization, which mitigates internal covariate shifts and ensures smoother optimization.

Finally, the encoder unit refines its output through a feed-forward network (FFN), which expands the feature space with a hidden layer dimension of

d_{ffn} = 4 \times D

, before projecting back to the original dimension D:

H_{En}^{i} = H_{En}^{i} + M_{En}^{i} (FFN (H_{En}^{i})) .

(10)

By stacking

N_{E}

such encoder units, the enhanced encoder progressively extracts contextual features from the interaction sequence, resulting in the final representation:

H_{En}^{N_{E}} = Encoder (X_{En}),

(11)

where

H_{En}^{N_{E}}

can capture local and global dependencies in sequences while effectively utilizing input embeddings selected by the adaptive embedding selection module to achieve more accurate modeling of students’ cognitive states.

3.3.2. Optimized Decoder

The decoder aims to predict the student’s response based on the encoder’s output

H_{E n}^{N_{E}}

. It consists of

N_{D}

stacked units, where each unit includes self-attention, linear attention, and a feed-forward network (FFN). The decoder maintains a consistent embedding dimension d across all layers, with the initial input

H_{D e}^{0}

, where

{i n p u t}_{En}

is generated from the selective input module.

Each decoder unit first processes its input through a normalization layer to ensure numerical stability. Self-attention is then applied to capture local dependencies within the sequence. Through residual connections, the updated representation

Z_{D e}^{j}

retains the original feature information while enhancing contextual modeling. The self-attention mechanism is formulated as follows:

Z_{D e}^{j} = H_{D e}^{j} + SelfAttn (H_{D e}^{j}),

(12)

where

H_{D e}^{j}

is the input of the j-th decoder unit, and

Z_{D e}^{j}

represents the intermediate representation.

To efficiently model global dependencies while maintaining computational feasibility, this paper designs a linear attention mechanism. The mechanism begins by generating query (Q), key (K), and value (V) matrices. Specifically, the decoder’s intermediate representation

Z_{D e}^{j}

and the encoder’s output

H_{E n}^{N_{E}}

are used for these computations. The query is obtained by applying a linear transformation to the concatenated encoder and decoder outputs:

Q = W_{a} [H_{E n}^{N_{E}}; Z_{D e}^{j}],

(13)

where

W_{a} \in R^{2 D \times D}

is a learnable transformation matrix, and

[;]

denotes concatenation along the feature dimension. The keys and values are derived solely from the encoder’s output as follows:

K = W_{b} H_{E n}^{N_{E}},

(14)

V = W_{c} H_{E n}^{N_{E}} .

(15)

Here,

W_{b}, W_{c} \in R^{D \times D}

are learnable transformation matrices, and

H_{E n}^{N_{E}} \in R^{L \times D}

represents the encoder’s output sequence. This ensures that the keys and values fully capture the global dependencies encoded by the encoder.

To compute the attention distribution, the scaled dot-product mechanism is applied between the queries and keys:

E = Softmax (\frac{Q K^{⊤}}{\sqrt{d}}),

(16)

where

E \in R^{L \times L}

represents the attention weights. This matrix encodes the relevance of each element in the encoder’s output to the decoder’s intermediate representation, allowing for selective focus on the most relevant global features.

Using the attention weights, the global context representation is obtained by weighting the value matrix V:

F = E V,

(17)

where

F \in R^{L \times D}

aggregates the most relevant information from the encoder outputs. This context representation encapsulates global dependencies and is critical for refining the decoder’s features.

To integrate the global context F into the decoder’s representation, a nonlinear transformation is applied:

{\tilde{Z}}_{D e}^{j} = Z_{D e}^{j} + ReLU (W_{f} F + Z_{D e}^{j}),

(18)

where

W_{f} \in R^{D \times D}

is a learnable transformation matrix. The addition operation ensures that the local features in

Z_{D e}^{j}

are preserved, while the global context F enhances the representation with long-range dependencies. The use of the ReLU activation function introduces nonlinearity, improving the model’s capacity to capture complex interactions.

This mechanism effectively balances computational efficiency with representational power, reducing the quadratic complexity of traditional self-attention mechanisms while maintaining the ability to model global dependencies.

To further refine the features, local convolutional operations and hybrid modules are applied. The convolutional layer captures short-term dependencies, while the hybrid operation

M_{D e}^{j}

dynamically selects kernel sizes and specific operations. The refined features are updated as follows:

H_{D e}^{j} = {\tilde{Z}}_{D e}^{j} + M_{D e}^{j} (Conv 1 d ({\tilde{Z}}_{D e}^{j})) .

(19)

Finally, the feed-forward network expands the feature space with a hidden layer dimension

d_{f f n} = 4 \times D

. The output of the FFN is combined with the features from the convolutional operation through residual connections:

H_{D e}^{j} = H_{D e}^{j} + M_{D e}^{j} (FFN (H_{D e}^{j})) .

(20)

The final decoder output

H_{D e}^{N_{D}}

is passed through a linear transformation and a Sigmoid function to predict the probability of the student’s response being correct:

R = σ (w_{r} H_{D e}^{N_{D}} + b_{r}),

(21)

where

w_{r} \in R^{D}

and

b_{r}

are the weights and biases of the fully connected layer. The model is trained by minimizing the binary cross-entropy loss:

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} log R_{i} + (1 - y_{i}) log (1 - R_{i})],

(22)

where

y_{i} \in {0, 1}

represents the ground truth label, and

R_{i}

is the predicted probability for the i-th sample.

3.4. Neural Architecture Search

To find the optimal network configuration, define the above model as a flexible search space, as shown in Figure 2. And apply NAS for optimization. The search space contains the following three groups of key parameters:

Input feature selection vectors:

b_{En}, b_{De} \in {0, 1}^{1 \times Num}

, where

N u m

is the total number of candidate features, controlling which input features will be used for the encoder and decoder.

Hybrid operation parameters: The hybrid operation parameters of the encoder and decoder are

α_{E n} \in R^{N_{E} \times K}

and

α_{D e} \in R^{N_{D} \times K}

, where K is the number of candidate operations. Different operations (such as convolution, feed-forward network, etc.) in each hybrid operation are weighted and combined through weights

α

, defining the final structure of the model.

Number of layer selection: The number of encoder layers

N_{E}

and the number of decoder layers

N_{D}

.

3.4.1. Objective Function

Let

θ

be all the weight parameters of the Transformer Transformer variant model and

A

represent the search space, then the goal of joint learning can be expressed as follows:

\begin{matrix} min_{a \in A} L_{v a l} (w^{*} (α), α), \end{matrix}

(23)

where

L t r a i n

and

L v a l

represent the loss functions of the model on the training set and validation set, respectively,

α

represents the architecture parameters in the search space, and

w^{*} (α)

represents the optimal weights obtained by training with the fixed architecture

α

.

3.4.2. Search Strategies

The search process employs two key strategies:

Weighted Combination of Operations: Each operation $o^{i}$ is assigned an importance weight $α_{i}$ :

$o (x) = \sum_{i = 1}^{K} \frac{exp (α_{i})}{\sum_{j = 1}^{K} exp (α_{i})} o^{i} (x) .$

(24)

This allows flexible combination of different structural elements, improving generalization.
Weight Sharing: Certain parameters are shared across different architectural configurations, significantly reducing computational overhead.

These strategies work in tandem to explore a wide range of model structures while maintaining computational feasibility.

3.4.3. Improved Evolutionary Algorithm

To further improve the efficiency of architecture search, this paper proposes improvements to the traditional evolutionary algorithm, shown in Figure 3. Existing evolutionary algorithms typically focus solely on optimizing model performance while neglecting the balance between model performance and computational efficiency. To address this issue, we introduce a model optimization efficiency objective. Specifically, we design an objective function that comprehensively considers both model performance and computational efficiency:

f (a) = \frac{A C C (a)}{T {(a)}^{β}}, a \in A

(25)

where

A C C (a)

denotes the accuracy of the model using architecture a on the validation set,

T (a)

represents the inference time of the model, and

β

is a balancing factor. By maximizing this objective function, the evolutionary algorithm can discover architectures that are both high-performing and computationally efficient.

Additionally, to further accelerate the search process, we employ a dynamic reduction strategy to adjust the size of the search space. Specifically, during the evolutionary process, the number of candidate operations is dynamically adjusted based on the fitness distribution of individuals in the population. As the evolution progresses, candidates with lower fitness are gradually eliminated, leading to a progressively smaller search space, thereby significantly reducing the time cost of the search.

3.4.4. Alternating Optimization

To efficiently optimize the architecture parameters

α

and weight parameters w, we adopt the gradient-based optimization method DARTS. The core idea of DARTS is to relax the discrete architecture selection into a continuous parameter mixture, so that gradient information can be used to optimize the architecture parameters and weight parameters simultaneously. In each training step, data

D_{t r a i n}

is first sampled from the training set, and the gradient is calculated based on the current architecture parameters

α_{t}

to update the weight parameters:

w_{t + 1} = w_{t} - η_{w} \nabla_{w} L t r a i n (w_{t}, α_{t}, D t r a i n) .

(26)

Then, data

D_{v a l}

is sampled from the validation set, and the gradient is calculated based on the updated weight parameters

w_{t + 1}

to update the architecture parameters:

α_{t + 1} = α_{t} - η_{α} \nabla_{α} L v a l (w t + 1, α_{t}, D_{v a l}),

(27)

where

η_{w}

and

η_{α}

represent the learning rates of weights and architecture, respectively. Through alternating optimization, the model can continuously update the architecture to adapt to the current task during the training process. After the search is over, for each position that requires a decision, we select the operation with the largest

α

value as the final architecture:

o^{*} = \underset{α_{i}}{arg max} o^{(i)} .

(28)

The operation with the largest weight

α_{i}

is selected as the optimal architecture

α^{*}

. After fixing the optimal architecture

α^{*}

, the model weights are retrained on the complete dataset to obtain the final optimized model. Through the above steps, the design space of the Transformer architecture can be explored efficiently and comprehensively, and the optimal architecture suitable for the Knowledge Tracing task can be automatically discovered.

4. Experiments

In this section, the experimental setup, result analysis, and further ablation studies are introduced to comprehensively evaluate the effectiveness of our proposed model, ANT-KT, in the knowledge-tracing task in detail. Hence, we validated the performance of the model on two large-scale Knowledge Tracing datasets and compared it with multiple strong baselines.

4.1. Datasets

During the experiment, we opted for the two most mainstream real-word educational datasets, EdNet [29] and RAIEd2020 [30], to rigorously evaluate the model’s performance. The key characteristics of these datasets are summarized in Table 1.

The datasets include feature embeddings for exercises, skills, tags, tag-sets, and bundles/explanations, as shown in Table 1.

In addition to these five embeddings, the candidate input embedding

X_{e m b e d}

incorporates seven additional feature embeddings, for a total of 12 embeddings: Answer embedding, Continuous elapsed time embedding, Categorical elapsed time embedding in seconds, Continuous lag time embedding, Categorical lag time embeddings in seconds, minutes, and days.

By leveraging these large-scale, real-world datasets with rich feature embeddings, the experiments provide a convincing validation of the proposed methodology in authentic educational contexts.

4.2. Baseline Models

We compared our model with the following Knowledge Tracing models:

DKT [8]: The DKT model is the first KT method leveraging deep neural networks, commonly used as a baseline model. It focuses only on concept labels, ignoring exercise-level information.
HawkesKT [31]: This model uses Hawkes processes to model the temporal dynamics in student interactions, enabling more accurate predictions of knowledge states by capturing the time-sensitive effects of prior responses.
CT-NCM [32]: The CT-NCM method incorporates a continuous-time neural cognitive model that captures knowledge forgetting and learning progress through time-aware neural architectures.
SAKT [10]: A self-attentive Knowledge Tracing model that utilizes attention mechanisms to model relationships among exercises, capturing dependencies without relying on sequence order.
AKT [24]: The AKT method employs context-aware representations and Rasch embeddings to dynamically link students’ historical responses with their future interactions.
SAINT [11]: This model adopts a Transformer-based encoder–decoder structure to integrate both student and exercise interactions for more comprehensive Knowledge Tracing.
SAINT+ [33]: Building on SAINT, SAINT+ incorporates additional temporal features such as elapsed time and lag time, enhancing its ability to model temporal dependencies in student learning.
NAS-Cell [28]: This method applies reinforcement learning to optimize RNN cell structures for Knowledge Tracing tasks, achieving architectures superior to manually designed ones.
DisKT [34]: The DisKT model addresses cognitive bias in Knowledge Tracing by disentangling students’ familiar and unfamiliar abilities through causal intervention, using a contradiction attention mechanism to suppress guessing/mistaking effects and integrating an Item Response Theory variant for interpretability.
ENAS-KT [12]: An evolutionary neural architecture search method tailored for KT, which designs a search space specialized for capturing Knowledge Tracing dynamics effectively.
AAKT [35]: The AAKT model reframes Knowledge Tracing as a generative autoregressive process, which alternatively encodes question–response sequences to directly model pre- and post-response knowledge states, and incorporates auxiliary skill prediction and extra exercise features by enhancing sequences through state-of-the-art Natural Language Generation (NLG) techniques.

4.3. Implementation Details

During the experiment, all students were randomly partitioned into three subsets with respective proportions of 70%, 10%, and 20%, serving as the training set, validation set, and test set. Meanwhile, the maximum length of the input sequences was set to 100; for sequences of student learning interaction that exceeded this length, truncation was performed to generate multiple subsequences. And a 5-fold cross-validation strategy was adopted to ensure the robustness of the experimental results. For the evolutionary search algorithm, we set the population size to 50 and the maximum number of generations to 100. These parameters follow established practices in evolutionary neural architecture search [36,37]. The balance factor

β

= 0.5 in Equation (25) is determined based on the Pareto optimization principle [38], which suggests equal weighting when objectives are normalized to comparable scales. This configuration ensures that neither prediction accuracy nor computational efficiency dominates the search process. In terms of model configuration, the number of blocks (N) was set to 4, the embedding dimension (D) was 128, and the hidden dimension of the feed-forward network (FFN) was configured as 128, too. These settings balance representational capacity and computational efficiency, following the principle of compact architectures widely adopted in resource-constrained sequential prediction tasks [37,39,40]. For the training process of the model, the hyperparameters were specified as follows: the number of training epochs was 60, the initial learning rate was 1 × 10⁻³, the dropout rate was 0.1, and the batch size was 128. And all experimental trials were conducted on a single server with an RTX3090 graphics processing unit.

4.4. Main Results

Table 2 presents the performance evaluation and comparison results between the ANT-KT model and current mainstream Knowledge Tracing models. Specifically, all data from the comparative models are derived from the results reported in their original papers, and the evaluation is conducted based on the tree metrics: RMSE (Root Mean Square Error), ACC (Accuracy), and AUC (Area Under the ROC Curve). Among these metrics, a smaller RMSE indicates better performance, while large values of ACC and AUC correspond to superior model performance.

As shown in Table 2, the ANT-KT model exhibits excellent performance on both the EdNET and RAIEd2020 datasets. Notably, it outperforms the baseline model ENAS-KT across all three evaluation metrics. In terms of Root Mean Square Error (RMSE), ANT-KT reduces the error to 0.4062—a further improvement compared to AAKT—while also achieving a 1.47% reduction relative to ENAS-KT. Beyond RMSE, ANT-KT also delivers impressive results in the AUC metric: it boosts the score by 3.25% when compared to ENAS-KT, which is recognized as the state-of-the-art method in this field. Consistent with its performance on EdNET, ANT-KT maintains significant advantages over the baseline ENAS-KT across all evaluation metrics when tested on the RAIEd2020 dataset as well.

4.5. Ablation Studies

As shown in Table 3, we performed ablation experiments by individually adjusting the key modules of the enhanced encoder and decoder to analyze the impact of different components of ANT-KT on the model performance.

The results show that the optimized decoder contributes most significantly to the improvement of model performance. For example, on the EdNet dataset, compared with the baseline model (which adopts standard MHSA in the decoder), the introduction of our optimized decoder (with linear attention) improves AUC from 0.8062 to 0.8208. This direct comparison confirms that linear attention can more effectively capture long-range dependencies within student interaction sequences. This allows the model to better utilize historical information during inference, leading to more accurate predictions. This finding aligns with the current research trend in the Knowledge Tracing field, which emphasizes the impact of students’ long-term learning processes on knowledge states.

In contrast, the contribution of the enhanced encoder is relatively limited and, in some cases, even results in slight performance degradation. The observed performance degradation when employing the enhanced encoder in isolation (AUC decreasing from 0.8062 to 0.7993 on EdNet) can be explained by two factors. First, the Conv1d operations, while effective at capturing short-term patterns, may disproportionately emphasize local features and thereby disrupt the modeling of long-range semantic dependencies that are crucial in Knowledge Tracing tasks. Second, in the absence of the optimized decoder’s linear attention mechanism, which serves to re-integrate global contextual information, the encoder’s locally extracted representations remain insufficiently refined.

However, when the enhanced encoder and the optimized decoder are used together, the model achieves further performance improvements. For example, on the EdNet dataset, ANT-KT achieves an AUC of 0.8387, which is higher than the 0.8208 obtained using only the optimized decoder. This suggests that, although the enhanced encoder alone may introduce interference, it can generate a synergistic effect when combined with the optimized decoder. The optimized decoder may mitigate the local-global imbalance introduced by the enhanced encoder by strengthening global information fusion, allowing the model to consider both local features and global semantics.

Overall, the ablation experiments reveal the mechanisms and interactions of different components in ANT-KT. The optimized decoder plays a critical role in improving performance by capturing long-range dependencies, while the enhanced encoder complements the model by extracting richer local information. Although the enhanced encoder may introduce interference when used alone, it produces a complementary effect when working synergistically with the optimized decoder, jointly driving the improvement of model performance.

4.6. Analysis of Search Optimization Strategies and Training Efficiency Comparison

To verify the performance advantages of ANT-KT in training efficiency, we selected NAS-based method ENAS-KT, the classic Transformer-based model Saint+, and several variants of ANT-KT for comparative experiments. We evaluated the time consumption of three key stages: supernet training, evolutionary search, and final architecture training. Table 4 shows the training time of different models in these three stages. As shown in Table 4, compared with ENAS-KT, in the supernet training stage, ANT-KT requires 39.28 h, which is 6.2% less than ENAS-KT’s 41.86 h.

During the evolutionary search phase, the dynamic search space reduction mechanism in ANT-KT is guided by a three-dimensional fitness vector that includes AUC, ACC, and latency (Lines 3–7 of Algorithm A1 in Appendix A). Specifically, this three-dimensional fitness vector can effectively guide the search by analyzing the elite individuals on the Pareto frontier (Lines 15–19 of Algorithm A1 in Appendix A) and adaptively adjusting the search boundaries (Lines 20–21). As shown in Table 4, this mechanism effectively reduces the time of the model framework in the evolutionary search phase, indicating that it can balance model performance and computational efficiency simultaneously.

In the evolutionary search phase, the unique advantages of ANT-KT’s search strategies are evident when comparing the optimal architecture discovered by ANT-KT as shown in Figure 4a, to that discovered by the original evolutionary algorithm as shown in Figure 4b. The original evolutionary algorithm tends to identify complex architectures dominated by MHSA modules, leading to high computational overhead. In contrast, ANT-KT’s optimal architecture incorporates multiple FFN and convolution modules in the encoder and balances the use of MHSA and FFN in the decoder. This design, which balances local feature extraction and global information fusion, improves model performance while effectively controlling computational complexity. As shown in Table 4, ANT-KT reduces search time from ENAS-KT’s 12.8 h to 7.42 h, a reduction of 42.0%. Additionally, compared to ANT-KT without the efficiency-aware search strategy, the search time is further reduced by 15.7%, demonstrating the effectiveness of the algorithm. Despite the introduction of efficiency considerations in the search optimization strategy, performance is not sacrificed. ANT-KT achieves an AUC of 0.8228 and an ACC of 0.7439, outperforming most baseline models. This highlights ANT-KT’s excellent balance between efficiency and performance, validating the effectiveness and robustness of its search optimization strategy.

5. Conclusions

This paper presents ANT-KT, a Transformer-based adaptive neural architecture search (NAS) framework designed to address the limitations of existing methods in Knowledge Tracing tasks, particularly in search efficiency and modeling accuracy. By integrating an enhanced encoder, an optimized decoder, and a refined evolutionary algorithm, ANT-KT achieves a significant balance between computational efficiency and predictive performance. Experimental results on two large-scale datasets demonstrate that ANT-KT consistently outperforms state-of-the-art baselines across multiple metrics, highlighting its ability to capture both short-term and long-term dependencies in student interaction data. Ablation studies further confirm the effectiveness of each proposed component, showcasing their contributions to the framework’s overall performance.

The findings of this study underline the potential of NAS-driven approaches in adaptive Knowledge Tracing and personalized learning systems. Future research could extend ANT-KT to a broader range of educational tasks and datasets, as well as explore further optimizations in the search process to enhance scalability and generalizability. Furthermore, given the still relatively lengthy training times, future work will focus on enhancing efficiency through progressive NAS and distillation-based search. This work offers valuable insights for advancing intelligent education systems and adaptive learning technologies.

Author Contributions

Conceptualization, S.Y. and Y.S.; methodology, S.Y.; software, Y.S.; validation, Y.S., Y.L. and J.C.; formal analysis, J.C.; writing—original draft preparation, S.Y.; writing—review and editing, D.Z. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Shandong Provincial Research Project on Artificial Intelligence Education (SDDJ202501034), the National Natural Science Foundation of China under Grant (62341603, 62006107), and the Introduction and Cultivation Program for Young Innovative Talents of Universities in Shandong Province (2021QCYY003).

Informed Consent Statement

Not applicable.

Data Availability Statement

EdNet: https://github.com/riiid/ednet (accedssed on 8 October 2025); RAIEd2020: https://www.kaggle.com/competitions/riiid-test-answer-prediction (accessed on 8 October 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A

Algorithm A1 Dynamic Multi-Objective Evolutionary Algorithm with Adaptive Search Space

1:: Input: Population size N, Max generations $G_{max}$ , initial bounds $Ω_{0}$
2:: Output: Final Pareto-optimal set $P^{*}$
3:: Initialize population $P_{0} = {x_{1}, \dots, x_{N}}$ within $Ω_{0}$
4:: for all $x_{j} \in P_{0}$ do
5:: Evaluate: ${AUC}_{j}$ , ${ACC}_{j}$ , ${latency}_{j}$
6:: $f (x_{j}) = {[- {AUC}_{j}, - {ACC}_{j}, {latency}_{j}]}^{T}$
7:: end for
8:: Perform non-dominated sorting $\to F_{0}$ and compute crowding distances $C D_{0}$
9:: $t \leftarrow 0$
10:: while $t < G_{max}$ do
11:: if $t > 0$ then
12:: $E \leftarrow {x \in P_{t} : ParetoRank (x) = 1}$
13:: Compute $μ$ and $σ$ for all objectives
14:: $Ω_{t + 1} \leftarrow AdaptBounds (E, Ω_{t}, t, μ, σ)$
15:: else
16:: $Ω_{t + 1} \leftarrow Ω_{0}$
17:: end if
18:: // — Selection —
19:: $Q \leftarrow \emptyset$
20:: for $i = 1$ to $N / 2$ do
21:: for $j = 1$ to 2 do
22:: Randomly pick $k_{1}$ , $k_{2} \in P_{t}$
23:: if $F [k_{1}] < F [k_{2}]$ then
24:: $p_{j} \leftarrow x_{k_{1}}$
25:: else if $F [k_{1}] > F [k_{2}]$ then
26:: $p_{j} \leftarrow x_{k_{2}}$
27:: else if $C D [k_{1}] > C D [k_{2}]$ then
28:: $p_{j} \leftarrow x_{k_{1}}$
29:: else
30:: $p_{j} \leftarrow x_{k_{2}}$
31:: end if
32:: end for
33:: $Q \leftarrow Q \cup {p_{1}, p_{2}}$
34:: end for
35:: // — Reproduction —
36:: $O \leftarrow \emptyset$
37:: for all $(p_{1}, p_{2}) \in Q$ do
38:: Generate $(o_{1}, o_{2})$ via crossover
39:: Mutate offspring within $Ω_{t + 1}$
40:: $O \leftarrow O \cup {o_{1}, o_{2}}$
41:: end for
42:: // — Evaluation —
43:: for all $o \in O$ do
44:: Evaluate: ${AUC}_{o}$ , ${ACC}_{o}$ , ${latency}_{o}$
45:: $f (o) = {[- {AUC}_{o}, - {ACC}_{o}, {latency}_{o}]}^{T}$
46:: end for
47:: // — Environmental Selection —
48:: $R \leftarrow P_{t} \cup O$
49:: Perform non-dominated sorting $\to F_{1}, F_{2}, \dots$
50:: $P_{t + 1} \leftarrow \emptyset$ , $i \leftarrow 1$
51:: while $| P_{t + 1} | + | F_{i} | \leq N$ do
52:: $P_{t + 1} \leftarrow P_{t + 1} \cup F_{i}$
53:: $i \leftarrow i + 1$
54:: end while
55:: if $| P_{t + 1} | < N$ then
56:: Compute crowding distance for $F_{i}$
57:: Sort $F_{i}$ descending by distance
58:: Add top $(N - | P_{t + 1} |)$ individuals to $P_{t + 1}$
59:: end if
60:: Update $F_{t + 1}$ and $C D_{t + 1}$ ; $t \leftarrow t + 1$
61:: end while
62:: return $P^{*} = {x \in P_{G_{max}} : ParetoRank (x) = 1}$

Algorithm A1 is designed based on the concept of “Pareto optimality” and mainly consists of 4 stages, namely the initialization stage, the main loop preparation stage, the main loop stage, and the output stage. In the algorithm, lines 1–7 belong to the initialization stage. In this stage, the definition of input and output is completed in lines 1–2; in line 3, the population

P_{0}

is randomly initialized within the initial boundary

Ω_{0}

; and the evaluation of the initial population is completed in lines 4–7. Lines 8–9 are the main loop preparation stage. Lines 10–59 are the main loop stage. In this stage, the adaptive update of the search space is completed in lines 11–18; lines 19–35 are the selection process; lines 36–42 are the reproduction process; lines 43–46 are the evaluation process; lines 47–58 are the environmental selection process; and the update is completed in line 59. Finally, the Pareto optimal solution set obtained is output in line 61.

Appendix B. Limitation Analysis

Appendix B.1. Dataset Generalization and Low-Resource Scenarios

The current evaluation primarily relies on two large-scale datasets (EdNet: 95 M interactions, RAIEd2020: 99 M interactions), which offer comprehensive and rigorous benchmarks. However, the generalizability of ANT-KT under limited-data conditions remains to be fully examined.

In particular, the model’s effectiveness in small-scale educational settings—such as ASSISTments (approximately 4K students)—is yet to be validated. The evolutionary search procedure, which requires about 7.4 h of computation, may impose a non-trivial cost for institutions with restricted data or computational resources. Moreover, smaller datasets heighten the risk of architecture overfitting, potentially undermining the stability of the searched model. In future work, we plan to explore transfer learning strategies to fine-tune architectures discovered on large-scale datasets for smaller domains, thereby improving efficiency and generalization.

Appendix B.2. Deployment Feasibility

Although ANT-KT achieves a favorable trade-off between accuracy and efficiency in offline evaluations, its real-world deployment feasibility remains untested. The current analysis focuses on model inference time, whereas practical tutoring systems involve additional latency sources such as data preprocessing, network transmission, and interface rendering. Consequently, the overall end-to-end delay on mobile or web-based platforms may exceed interactive thresholds (<100 ms [41]). Moreover, performance on resource-constrained devices (e.g., tablets or older school computers) has not been validated, indicating the need for compression or quantization in future implementation. Overall, this study is limited to algorithmic evaluation; validating ANT-KT under real deployment conditions will be an important direction for future research.

References

Abdelrahman, G.; Wang, Q.; Nunes, B. Knowledge Tracing: A Survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Dowling, C.; Hockemeyer, C. Automata for the assessment of knowledge. IEEE Trans. Knowl. Data Eng. 2001, 13, 451–461. [Google Scholar] [CrossRef]
Geigle, C.; Zhai, C. Modeling MOOC Student Behavior with Two-Layer Hidden Markov Models. In Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale, Cambridge, MA, USA, 20–21 April 2017; pp. 205–208. [Google Scholar]
Anderson, A.; Huttenlocher, D.; Kleinberg, J.; Leskovec, J. Engaging with massive online courses. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Republic of Korea, 7–11 April 2014; pp. 687–698. [Google Scholar]
Corbett, A.T.; Anderson, J.R. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Model.-User-Adapt. Interact. 2005, 4, 253–278. [Google Scholar] [CrossRef]
Yudelson, M.V.; Koedinger, K.R.; Gordon, G.J. Individualized Bayesian Knowledge Tracing Models. In Proceedings of the Artificial Intelligence in Education, Memphis, TN, USA, 9–13 July 2013; pp. 171–180. [Google Scholar]
Baker, R.S.J.d.; Corbett, A.T.; Gowda, S.M.; Wagner, A.Z.; MacLaren, B.A.; Kauffman, L.R.; Mitchell, A.P.; Giguere, S. Contextual Slip and Prediction of Student Performance after Use of an Intelligent Tutor. In Proceedings of the User Modeling, Adaptation, and Personalization, Big Island, HI, USA, 20–24 June 2010; pp. 52–63. [Google Scholar]
Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep Knowledge Tracing. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Zhang, J.; Shi, X.; King, I.; Yeung, D.Y. Dynamic Key-Value Memory Networks for Knowledge Tracing. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 765–774. [Google Scholar]
Pandey, S.; Karypis, G. A Self-Attentive model for Knowledge Tracing. arXiv 2019, arXiv:1907.06837. [Google Scholar] [CrossRef]
Choi, Y.; Lee, Y.; Cho, J.; Baek, J.; Kim, B.; Cha, Y.; Shin, D.; Bae, C.; Heo, J. Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing. In Proceedings of the Seventh ACM Conference on Learning @ Scale, Virtual, 12–14 August 2020; pp. 341–344. [Google Scholar]
Yang, S.; Yu, X.; Tian, Y.; Yan, X.; Ma, H.; Zhang, X. Evolutionary Neural Architecture Search for Transformer in Knowledge Tracing. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 2019, 20, 1997–2017. [Google Scholar]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Dao, T.; Gu, A. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Washington, DC, USA, 2019. AAAI’19/IAAI’19/EAAI’19. [Google Scholar] [CrossRef]
Xiong, X.; Zhao, S.; Inwegen, E.G.V.; Beck, J.E. Going deeper with deep knowledge tracing. In Proceedings of the International Educational Data Mining Society, Raleigh, NC, USA, 29 June–2 July 2016. [Google Scholar]
Shen, S.; Huang, Z.; Liu, Q.; Su, Y.; Wang, S.; Chen, E. Assessing Student’s Dynamic Knowledge State by Exploring the Question Difficulty Effect. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; SIGIR ’22, pp. 427–437. [Google Scholar] [CrossRef]
Liu, Z.; Liu, Q.; Chen, J.; Huang, S.; Gao, B.; Luo, W.; Weng, J. Enhancing Deep Knowledge Tracing with Auxiliary Tasks. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; WWW ’23, pp. 4178–4187. [Google Scholar] [CrossRef]
Yeung, C.K.; Yeung, D.Y. Addressing two problems in deep knowledge tracing via prediction-consistent regularization. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, London, UK, 26–28 June 2018. L@S ’18. [Google Scholar] [CrossRef]
Nakagawa, H.; Iwasawa, Y.; Matsuo, Y. Graph-based Knowledge Tracing: Modeling Student Proficiency Using Graph Neural Network. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Thessaloniki, Greece, 14–17 October 2019; WI ’19, pp. 156–163. [Google Scholar] [CrossRef]
Yang, Y.; Shen, J.; Qu, Y.; Liu, Y.; Wang, K.; Zhu, Y.; Zhang, W.; Yu, Y. GIKT: A Graph-Based Interaction Model for Knowledge Tracing. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, 14–18 September 2020; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2020; pp. 299–315. [Google Scholar] [CrossRef]
Miller, A.; Fisch, A.; Dodge, J.; Karimi, A.H.; Bordes, A.; Weston, J. Key-Value Memory Networks for Directly Reading Documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1400–1409. [Google Scholar] [CrossRef]
Ghosh, A.; Heffernan, N.; Lan, A.S. Context-Aware Attentive Knowledge Tracing. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 2330–2339. [Google Scholar]
Cheng, K.; Peng, L.; Wang, P.; Ye, J.; Sun, L.; Du, B. DyGKT: Dynamic Graph Learning for Knowledge Tracing. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; KDD’24, pp. 409–420. [Google Scholar] [CrossRef]
Fu, L.; Guan, H.; Du, K.; Lin, J.; Xia, W.; Zhang, W.; Tang, R.; Wang, Y.; Yu, Y. SINKT: A Structure-Aware Inductive Knowledge Tracing Model with Large Language Model. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; CIKM ’24, pp. 632–642. [Google Scholar] [CrossRef]
Pardos, Z.A.; Bhandari, S. Learning gain differences between ChatGPT and human tutor generated algebra hints. arXiv 2023, arXiv:2302.06871. [Google Scholar] [CrossRef]
Ding, X.; Larson, E.C. Automatic RNN Cell Design for Knowledge Tracing using Reinforcement Learning. In Proceedings of the Seventh ACM Conference on Learning @ Scale, Virtual, 12–14 August 2020; pp. 285–288. [Google Scholar]
Choi, Y.; Lee, Y.; Shin, D.; Cho, J.; Park, S.; Lee, S.; Baek, J.; Bae, C.; Kim, B.; Heo, J. EdNet: A Large-Scale Hierarchical Dataset in Education. In Proceedings of the Artificial Intelligence in Education, Ifrane, Morocco, 6 July 2020; pp. 69–73. [Google Scholar]
Howard, A.; bskim90; Lee, C.; Shin, D.M.; Jeon, H.P.T.; Baek, J.J.; Chang, K.; kiyoonkay; Heffernan, N.; seonwooko; et al. Riiid Answer Correctness Prediction. 2020. Available online: https://www.kaggle.com/competitions/riiid-test-answer-prediction (accessed on 8 October 2025).
Wang, C.; Ma, W.; Zhang, M.; Lv, C.; Wan, F.; Lin, H.; Tang, T.; Liu, Y.; Ma, S. Temporal Cross-Effects in Knowledge Tracing. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual, 8–12 March 2021; pp. 517–525. [Google Scholar]
Ma, H.; Wang, J.; Zhu, H.; Xia, X.; Zhang, H.; Zhang, X.; Zhang, L. Reconciling Cognitive Modeling with Knowledge Forgetting: A Continuous Time-aware Neural Network Approach. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 2174–2181. [Google Scholar]
Shin, D.; Shim, Y.; Yu, H.; Lee, S.; Kim, B.; Choi, Y. SAINT+: Integrating Temporal Features for EdNet Correctness Prediction. In Proceedings of the LAK21: 11th International Learning Analytics and Knowledge Conference, Irvine, CA, USA, 12–16 April 2021; pp. 490–496. [Google Scholar]
Zhou, Y.; Lv, Z.; Zhang, S.; Chen, J. Disentangled knowledge tracing for alleviating cognitive bias. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 2633–2645. [Google Scholar]
Zhou, H.; Rong, W.; Zhang, J.; Sun, Q.; Ouyang, Y.; Xiong, Z. AAKT: Enhancing Knowledge Tracing With Alternate Autoregressive Modeling. IEEE Trans. Learn. Technol. 2025, 18, 25–38. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient Neural Architecture Search via Parameters Sharing. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Proceedings of Machine Learning Research (PMLR): Brookline, MA, USA, 2018; Volume 80, pp. 4095–4104. [Google Scholar]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Liu, S.; Wu, D.; Sun, H.; Zhang, L. A Novel BeiDou Satellite Transmission Framework With Missing Package Imputation Applied to Smart Ships. IEEE Sens. J. 2022, 22, 13162–13176. [Google Scholar] [CrossRef]
Liu, S.; Wu, D.; Zhang, L. CGAN BeiDou Satellite Short-Message-Encryption Scheme Using Ship PVT. Remote Sens. 2023, 15, 171. [Google Scholar] [CrossRef]
Nielsen, J. Usability Engineering; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1994. [Google Scholar]

Figure 1. The architecture of the Adaptive Neural Architecture Search framework based on the Transformer, ANT-KT. The encoder processes input features to capture local and global dependencies, while the decoder receives the output of the encoder and generates predicted probabilities of knowledge states.

Figure 2. The proposed search space. In the figure on the right, [*] denotes the sequence number corresponding to the operation of a module; for instance, [0] represents the Zero Operation. Meanwhile, in the Global Operation, the numbers inside the rectangles indicate the number of convolution kernels.

Figure 3. The algorithm starts with population initialization through random selection, followed by mating pool selection. During offspring generation, crossover and mutation operations are performed on parent pairs. The key innovation lies in the individual evaluation and search space reduction stages, where a three-dimensional fitness vector (AUC, ACC, Latency) is constructed and used to guide the search process. This approach enables more efficient exploration of the search space by considering multiple performance metrics simultaneously.

Figure 4. Comparison of the searched architectures. (a) Optimal architecture discovered by ANT-KT; (b) Architecture produced by the original (baseline) evolutionary algorithm. Symbols: Orange rectangle = MHSA (multi-head self-attention); blue parallelogram = FFN (feed-forward network); green trapezoid = Conv-k (1-D convolution, kernel size k); gray circle = Zero (null operation, i.e., skip connection). Layout: Each vertical stack represents one encoder/decoder layer; the sequential order from top to bottom reflects the selected module arrangement within that layer. Color intensity: Darker shades indicate modules that are repeatedly selected; lighter shades are used only once in the final cell.

Table 1. Statistics of the two largest education datasets, EdNet and RAIEd2020.

Datasets	# of Interactions	# of Students	# of Exercises	# of Skills
Datasets	(# of Tags)	(# of Tag-Sets)	(# of Bundles)	(# of Explanations)
EdNet	95,293,926 (302)	84,309 (1792)	13,169 (9534)	7 (-)
RAIEd2020	99,271,300 (189)	393,656 (1520)	13,523 (-)	7 (2)

Table 2. Comparsion of various models on EdNet and RAIEd2020 datasets. Best results are highlighted in Bold, and the second best results are underlined.

Model	EdNet			RAIEd2020
Model	RMSE	ACC	AUC	RMSE	ACC	AUC
DKT	0.4653	0.6537	0.6952	0.4632	0.6622	0.7108
HawkesKT	0.4475	0.6888	0.7487	0.4453	0.6928	0.7525
CT-NCM	0.4364	0.7063	0.7743	0.4355	0.7079	0.7771
SAKT	0.4405	0.6998	0.7680	0.4381	0.7035	0.7693
AKT	0.4399	0.7016	0.7686	0.4368	0.7076	0.7752
SAINT	0.4322	0.7132	0.7825	0.4310	0.7143	0.7862
SAINT+	0.4285	0.7188	0.7916	0.4272	0.7192	0.7934
NAS-Cell	0.4345	0.7143	0.7796	0.4309	0.7167	0.7839
DisKT	0.4592	0.6863	0.7384	-	-	-
ENAS-KT	0.4209	0.7295	0.8062	0.4196	0.7313	0.8089
AAKT	0.4064	0.7554	0.7827	-	-	-
ANT-KT	0.4062	0.7553	0.8387	0.4122	0.7438	0.8239
Improve	−1.47%	0%	3.25%	−0.74%	1.25%	1.50%

Table 3. Performance of ablation studies.

Dataset	Metric	Baseline	Encoder	Decoder	ANT-KT
EdNet	RMSE	$0.4209$	$0.4246$	$0.4200$	$0.4062$
	ACC	$0.7295$	$0.7243$	$0.7424$	$0.7553$
	AUC	$0.8062$	$0.7993$	$0.8208$	$0.8387$
RAIEd2020	RMSE	$0.4196$	$0.4275$	$0.4147$	$0.4122$
	ACC	$0.7313$	$0.7198$	$0.7408$	$0.7438$
	AUC	$0.8089$	$0.7939$	$0.8201$	$0.8239$

Table 4. Comparison of training time and performance across different models and ANT-KT variants.

Model	Supernet Training Time (h)	Evolutionary Search Time (h)	Final Architecture Training Time (h)	AUC	ACC
Saint+	NAN	NAN	2.381	0.7934	0.7192
ENAS-KT	41.863	12.8	8.784	0.8089	0.7313
ANT-KT (Encoder)	41.438	8.65	6.066	0.7993	0.7243
ANT-KT (Decoder)	39.767	8.921	17.85	0.8208	0.7424
ANT-KT	39.288	8.81	5.39	0.8387	0.7553
ANT-KT (Search optimization)	39.288	7.42	4.357	0.8228	0.7439

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, S.; Song, Y.; Liu, Y.; Chen, J.; Zhao, D.; Wang, X. ANT-KT: Adaptive NAS Transformers for Knowledge Tracing. Electronics 2025, 14, 4148. https://doi.org/10.3390/electronics14214148

AMA Style

Yao S, Song Y, Liu Y, Chen J, Zhao D, Wang X. ANT-KT: Adaptive NAS Transformers for Knowledge Tracing. Electronics. 2025; 14(21):4148. https://doi.org/10.3390/electronics14214148

Chicago/Turabian Style

Yao, Shuanglong, Yichen Song, Ye Liu, Ji Chen, Deyu Zhao, and Xing Wang. 2025. "ANT-KT: Adaptive NAS Transformers for Knowledge Tracing" Electronics 14, no. 21: 4148. https://doi.org/10.3390/electronics14214148

APA Style

Yao, S., Song, Y., Liu, Y., Chen, J., Zhao, D., & Wang, X. (2025). ANT-KT: Adaptive NAS Transformers for Knowledge Tracing. Electronics, 14(21), 4148. https://doi.org/10.3390/electronics14214148

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ANT-KT: Adaptive NAS Transformers for Knowledge Tracing

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Problem Definition

3.2. Proposed Method: ANT-KT Framework

3.3. Adaptive Embedding Selection Module

3.3.1. Enhanced Encoder

3.3.2. Optimized Decoder

3.4. Neural Architecture Search

3.4.1. Objective Function

3.4.2. Search Strategies

3.4.3. Improved Evolutionary Algorithm

3.4.4. Alternating Optimization

4. Experiments

4.1. Datasets

4.2. Baseline Models

4.3. Implementation Details

4.4. Main Results

4.5. Ablation Studies

4.6. Analysis of Search Optimization Strategies and Training Efficiency Comparison

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B. Limitation Analysis

Appendix B.1. Dataset Generalization and Low-Resource Scenarios

Appendix B.2. Deployment Feasibility

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI