DCKT: A Novel Dual-Centric Learning Model for Knowledge Tracing

: Knowledge tracing (KT), aiming to model learners’ mastery of a concept based on their historical learning records, has received extensive attention due to its great potential in realizing personalized learning in intelligent tutoring systems. However, most existing KT methods focus on a single aspect of knowledge or learner, not paying careful attention to the coupling inﬂuence of knowledge and learner characteristics. To ﬁll this gap, in this paper, we explore a new paradigm for the KT task by exploiting the coupling inﬂuence of knowledge and learner. A novel model called Dual-Centric Knowledge Tracing (DCKT) is proposed to model knowledge states through two joint tasks of knowledge modeling and learner modeling. In particular, we ﬁrst generate concept embeddings in abundant knowledge structure information via a pretext task (knowledge-centric): unsupervised graph representation learning. Then, we deeply measure learners’ prior knowledge the knowledge-enhanced representations and three predeﬁned educational priors for discriminative feature enhancement. Furthermore, we design a forgetting-fusion transformer (learner-centric) to simulate the declining trend of learners’ knowledge proﬁciency over time, representing the common forgetting phenomenon. Extensive experiments were conducted on four public datasets, and the results demonstrate that DCKT could achieve better knowledge tracing results over all datasets via a dual-centric modeling process. Additionally, DCKT can learn meaningful question embeddings automatically without manual annotations. Our work indicates a potential future research direction for personalized learner modeling, which is of both accuracy and high interpretability.


Introduction
The past few decades have witnessed the rapid development of online education platforms to improve learning efficiency while minimizing the cost of education [1], such as massive open online courses (MOOCs) and intelligent tutoring systems (ITS).Knowledge tracing (KT) [2] is an essential task in online education platforms.Given learners' past learning records, it aims to track and quantify their knowledge state over time to make accurate predictions on future performance.Concretely, supposing there are a set of t discrete time indices, we use the following generic model to represent a learner's hidden knowledge state and historical performance: where the hidden variable h t denotes the learner's knowledge state at time step t, and the binary value r t ∈ {0, 1} denotes the predicted learner's response at the current question (with 1 representing a correct answer and 0 representing an incorrect answer).f (•) and g(•) are the two functions that characterize the learners' knowledge evolution and predict their future responses, respectively.Once the knowledge proficiency is precisely estimated through KT, learners can make up for their weaknesses in time and thus maximize the learning outcome.Due to its great potential for personalized learning, KT attracts increasing interest and is widely used in the scientific and educational communities [3][4][5].
Research efforts on KT tasks usually focus on a single aspect of knowledge or learners for different purposes, and Table 1 briefly summarizes these models from the two aspects, respectively.On the one side, classical KT models, such as Bayesian knowledge tracing (BKT) [2], deep knowledge tracing (DKT) [6], and DKT-forget [7], concentrate on mining learners' interaction information and estimate hidden knowledge states from their learning performance data, and pay less attention to knowledge estimation.On the other side, some models pay more attention to knowledge modeling.For example, prerequisite-driven deep knowledge tracing (PDKT-C) [8], structure-based knowledge tracing (SKT) [9], PQR-LKA [10], and AKT [4] highlight the importance of knowledge structures or the need to learn embedding representations with plentiful domain knowledge but assess learners' knowledge in simple ways, such as a simple RNN.However, knowledge and learner characteristics have a combined effect during the process of learners' cognition and knowledge growth; ignoring the knowledge factor or the learner factor will lead to a decrease in the prediction accuracy of knowledge tracing, and only combining knowledge and learner factors can make more accurate predictions (we show this in the experiment section).× × × × DKT [6] × × × × DKT-forget [7] × × × PDKT-C [8] × × × SKT [9] × × × PQRLKA [10] × × × AKT [4] × × DCKT (ours) From the perspective of learner modeling [3,11] in education theory, knowledge tracing is a typical learner modeling technique whose process involves human knowledge and learning.For better illustration, we give an example of knowledge tracing in Figure 1.The middle part depicts the learning process, where a learner practices a sequence of questions {q 1 , q 2 , . . ., q 8 } associated with a concept set {c 1 , c 2 , c 3 , c 4 }, and after this, the learner is informed whether their answers are correct or not.In addition to these observable phenomena, there are implicit but non-negligible details in a KT task.As shown in the upper dotted box, the learner's knowledge state of all concepts constantly changes during learning, and the whole process reflects their cognitive evolution.Moreover, the learner's knowledge proficiency of a specific concept shows a declining trend since their last practice, which is attributed to human memory decay in cognitive science, known as forgetting behaviors.The bottom dotted box in Figure 1 shows the latent knowledge structure, where the colored undirected lines represent association relations and the black directed lines represent prerequisite relations.We can observe that the knowledge components are linked by multiple relations, including association relations between questions and concepts and prerequisite relations within concepts.
Therefore, it is necessary to give equal importance to the characteristics of knowledge and learners and to integrate the process of knowledge modeling and learner modeling effectively for knowledge tracing.Although both aspects are somewhat involved, previous deep-learning-based KT methods cannot meet this requirement due to three major challenges.First, the knowledge structure, which is inextricably linked to modeling domain knowledge, is inherent and implicit in the KT scenario.For instance, each question may relate to multiple concepts, and different concepts also have potential correlations, thus making it difficult to learn the complex relational dependencies between these knowledge components.Yang et al. [12] propose a graph-based Interaction model for Knowledge Tracing (GIKT) to learn the graph embeddings of questions and skills from high-order relations.However, the defined relationships between concepts and questions rely on many expert annotations.Second, prior knowledge is the basis of learners' differentiated knowledge proficiency and an important criterion for evaluating personalized learning, which needs to be measured based on learning performance data.Third, knowledge decline is an inevitable phenomenon in the learning process, commonly attributed to forgetting behaviors.Hence, it is also a huge challenge to be solved in KT tasks.To address the above challenges, we propose a novel KT framework called Dual-Centric Knowledge Tracing (DCKT) to integrate the two subtasks of knowledge discovery and knowledge tracing.The purpose of the former is to serve the latter.Specifically, since knowledge structure is implicit and static, we exploit knowledge structure features as crucial domain information to enhance the KT task.Following this idea, concept embeddings are generated via a well-designed pretext task, which constructs the knowledge structure through an unsupervised representation learning method without needing manual labels.In particular, we compute a transition probability matrix from the large-scale learning logs based on specific statistics.A concept prerequisite graph is constructed with the matrix, and high-order relations between concepts in the concept prerequisite graph are learned with graph neural networks (GNNs).Notably, skill-level KT datasets identify each question by its underlying concept based on the Q-matrix, combined with graph representation learning for prerequisite relationships within concepts.Thus, the produced concept embeddings contain a wealth of knowledge structure information.Then, to measure learners' prior knowledge, we generate knowledge-enhanced embedding representations and represent learners' knowledge proficiency using three predefined educational priors to enhance the discriminative features.Therefore, we are more capable of capturing learners' personalized traits at a finer-grained level from long-term behaviors.Finally, for modeling forgetting behaviors over a long study period, we design a forgetting-fusion transformer to determine the rate of learners' knowledge decline over time.
To sum up, the main contributions of this work are as follows: • We propose a novel KT model, namely DCKT, which combines the task of knowledge discovery with knowledge tracing and leverages the former to benefit the latter, i.e., a knowledge-centric module, called concept graph representation learning, and a learner-centric module, called KT, with fine-grained forgetting behaviors modeling.

•
We explore an unsupervised representation learning method that automatically infers domain prerequisites and learns graph representations for concepts, which can be leveraged to enhance knowledge tracing.

•
We design a novel forgetting-fusion transformer to model the forgetting behaviors of learners with exponential decay attention to quantifying the forgetting effect during learning.

•
We conduct extensive experiments to evaluate the performance of our proposed DCKT model on four public KT datasets.The results demonstrate the effectiveness of DCKT in concept prerequisite inferences and knowledge tracing.
The remainder of this paper is organized as follows.Section 2 reviews the related work.Section 3 lists important problem definitions and notations in the research.Section 4 introduces our proposed KT model.Section 5 details the four research questions and the experiment settings.Section 6 presents the experiment results and discusses these research questions.Finally, the paper concludes in Section 7.

Knowledge Tracing
Generally speaking, existing KT work can be divided into two categories: traditional statistics-based models and deep-learning-based models.The first class of traditional KT models is Bayesian knowledge tracing (BKT) [2], which uses a probabilistic graphical model such as hidden Markov models (HMMs) to track the latent knowledge state.BKTbased methods assume that the current knowledge state is determined by the state at the previous time step.They model the knowledge state as a set of binary latent variables based on mastery learning [13].The second traditional KT category comprises factor analysis models with logistic regression, such as the Additive Factor Model (AFM) [14], Performance Factor Analysis (PFA) [15], and Knowledge Tracing Machine (KTM) [16].The key idea of these models is to predict the response performance by learning a logistic function, which considers a wide range of factors such as learner, concept, item, or learning environment.Although both statistical KT models have good interpretability, the limitations of manual tag reliance and oversimplified assumptions prevent them from mining the complex knowledge state of learners.
With the huge breakthrough of deep learning in various fields, Piech et al. [6] introduced deep learning techniques into KT for the first time.Deep Knowledge Tracing (DKT) employs recurrent neural networks (RNNs) or their variant Long Short-Term Memory (LSTM) on learning interaction sequences and models the knowledge state as a highdimensional hidden state at each time step, showing great potential for learning performance prediction.Dynamic Key-Value Memory Networks (DKVMN) [17] use a memory network to enrich the hidden variable representations in the KT task.They design two matrices for tracking knowledge state over time, with a static matrix called key to store the latent concepts underlying all questions and a dynamic matrix called value to store and update the mastery level of each concept through reading and writing operations.To model complex learning behaviors in real-world education scenarios, many variants focus on integrating rich features as side information for KT.For the personalized modeling of learners, Deep Knowledge Tracking for Dynamic Student Classification (DKT-DSC) [18] extends DKT by clustering learners with similar ability levels using the K-means clustering algorithm and incorporating the clustering results into the model input.In addition to these student-specific factors, some work explores the inclusion of knowledge characteristics to enhance the KT task.For example, PDKT-C [8] leverages the prerequisite relations between latent concepts as additional constraints, and EERNN [19], EKT [20], and MathBERT [21] incorporate textual features of questions as additional input to the KT model.
Various emerging techniques have been recently applied to tackle the KT problem.Inspired by the powerful capabilities of Transformer [22] in time series analysis, Self-Attentive Knowledge Tracing (SAKT) [23] introduced an attention mechanism into KT for the first time.Later, Context-aware Attentive Knowledge Tracing(AKT) [4] modified the original scaled dot-product attention and proposed monotonic attention to learn contextaware representations.It computes attention weights for questions by simulating the forgetting effect as a time distance measure.However, although the common idea of attention-based KT models is to learn attention weights of key knowledge components, most works ignore the influence of learners' personalization characteristics on their learning.In this regard, Convolutional Knowledge Tracing (CKT) [24] leverages convolutional neural networks (CNNs) to model the individualization of learners based on their individualized prior knowledge and learning rates.Considering various graph structures that naturally exist in KT, graph neural networks (GNNs) are designed to process these graph-structured data mining relational structures for better embedding representations, including the GKT [25], GIKT [12], SGKT [26], and Bi-CLKT [27] models.While knowledge tracing for deep learning has shown promising performance results, limited work explicitly defines the KT task from knowledge and learner perspectives and emphasizes the combined role of both in the modeling process.

Concept Prerequisite Inferences in KT
Due to the fundamental role that concepts play in human cognitive processes [28], the inferences of concept prerequisites have been studied in various educational contexts.For example, Wang et al. [29] leverage prerequisites to construct concept maps from textbooks.Pan et al. [30] design a representational learning-based method and different leveraged features to infer the prerequisite relation between course concepts in MOOCs.To alleviate the manually labeled reliance on course prerequisites, Roy et al. [31] propose a new supervised learning method capable of identifying unknown concept prerequisites with labeled concept prerequisite data and course prerequisites.
For knowledge modeling, many deep learning KT models automatically attempt to infer concept prerequisite relationships.Chen et al. [32] propose a novel algorithm named COMMAND to simultaneously learn a concept prerequisite graph and a student model from performance data, which models the concept prerequisite relations as a Bayesian network through a two-stage learning process.To address the data sparsity issue, PDKT-C [8] advocates for incorporating knowledge structure information into the KT model, especially the prerequisite relations between pedagogical concepts.It first models prerequisites as ordered pairs, then combines them with a proper mathematical formulation to serve as model constraints.Inspired by the success of GNNs in relation learning, GKT [25] utilizes the graph-structured nature of knowledge as a relational inductive bias and reformulates the KT task as a time series node-level classification problem in GNNs.This work proposed statistics-based and learning-based approaches to construct latent knowledge graphs, where nodes represent concepts and edges represent the dependency relation between concepts, such as similarity and prerequisite relations.Unlike the graph data in GKT, which only involve a single relation between concepts, SKT [9] captures multiple relations between concepts and learns graph embeddings through information propagation.An increasing number of KT models extract knowledge structures to enrich embedding representations, but there is limited work considering the static nature of knowledge and serving domain knowledge as an important supplement to dynamic KT tasks.

Forgetting Behaviors in KT
In cognitive psychology studies [33,34], there is broad evidence showing that forgetting behaviors significantly impact learners' knowledge proficiency and post-learning performance.Moreover, the well-known Ebbinghaus forgetting curve theory [35] shows that learners tend to forget what they have learned at an exponentially decaying rate.Therefore, forgetting modeling is highly active in many KT models.Nedungadi and Remya [36] extended BKT by incorporating forgetting behaviors into their model, which is viewed as knowledge decline over time and measured by an exponentially decaying function.To characterize more complex forgetting behaviors in the entire sequence, DKT-forget [7] adds three types of forgetting features that reflect both the learning and forgetting effects to the DKT model.Similar to the idea of DKT-forget, a probabilistic matrix factorization model called Knowledge Proficiency Tracing (KPT) [37] captures knowledge-state dynamics over time based on the forgetting curve and learning curve theories.A recent attempt at a forgetting-aware KT is the Deep Graph Memory Network (DGMN) [38] model, which uti-lizes GNNs to learn forgetting behavior dynamically.DGMN differs from previous models in that a dynamic graph is built for identifying mutual relationships among concepts to model forgetting behaviors over the latent concept space.Though the above models attach great importance to the phenomenon of forgetting, they ignore the dynamic influence of knowledge decline and proficiency level change on human memory retention during learning, which limits the ability to capture nonmonotonic forgetting behaviors.

Problem Definition
An online education platform encompasses a set of learner L, a wide range of knowledge components, including a set of questions Q = {q 1 , q 2 , • • • , q N }, and a set of concepts In a KT task, the learning process is typically viewed as a composition of interactions between learners and knowledge components across consecutive time steps, which is explicitly reflected by learners' question-answering records.Along this line, knowledge tracing can be reasonably formulated as a sequence prediction problem.We denote a learner's learning sequence with t time steps as Here, q i ∈ Q refers to the question answered at time step i, and r i ∈ {0, 1} indicates whether the question q i has been answered correctly, with 0 representing wrong and 1 representing correct.Important definitions are given as follows: is a binary matrix that describes correlations between all the questions Q and concepts C, which is typically predefined by domain experts.If question q i is related to the concept c j , then concept nodes.These concepts share prerequisite dependencies denoted as E ∈ V × V; X ∈ R M×D represents the node feature matrix, and D is the feature dimension.The topology of the graph is defined as the adjacency matrix A ∈ R M×M , where A c i ,c j = 1 means concept c i ∈ C is the prerequisite of concept c j ∈ C, and A c i ,c j = 0 otherwise.Definition 3 (Knowledge Tracing).Given a learner's learning sequence X t = (x 1 , x 2 , ..., x t ) and the next question q t+1 , the objective of the KT task is to assess the learner's evolving knowledge state over time and predict the probability of q t+1 being answered correctly at the time step t + 1.
Like the traditional skill-level KT method, this work denotes every question by its underlying concept through a question-to-concept mapping.A list of important notations used in DCKT is presented in Table 2. the learning interaction at time t c i an embedding representation of concept c i q t an embedding representation of question q t x t an embedding representation of learning interaction x t r t learner's ground-truth response to the question q t Variable Description rt the model-predicted learner's response to the question q t h t learner's hidden knowledge state at time t E K a embedding matrix of learner's personalized prior

Predefined Embeddings
To realize the main goal of knowledge tracing, we consider the following input elements: concepts, questions, answers, and interactions.In DCKT, all embeddings are associated with the concept embeddings, which are randomly initialized as E C ∈ R M×D , where D represents the embedding dimension.After concept graph representation learning, the trained concept embeddings are mapped to the question embedding matrix E Q ∈ R t×D .For easy calculation and unified representation, we convert the response r i to a zero vector with the same D dimension as r i ∈ R D .The exactness of a learner's responses greatly affects the knowledge state assessment, so we distinguish between wrong and correct response representations.The learning interaction representation x i ∈ R 2D are defined as: where ⊕ denotes the concatenation operation.We represent the embedding matrix of learning interactions (LI) as LI ∈ R t×2D .

The DCKT Model
This section introduces our proposed DCKT in detail, which consists of two modules: Unsupervised Graph Representation Learning (knowledge-centric module) and KT with Fine-grained Forgetting Behaviors Modeling (learner-centric module).Figure 2 shows the model architecture of DCKT.

Unsupervised Graph Representation Learning
This module aims to discover the latent knowledge graph structure and incorporate concept prerequisites as domain knowledge in preparation for the subsequent tasks.

Knowledge Structure Construction
In a KT task, knowledge components consist of learning sequences, e.g., concepts and questions, regularly organized following some inherent rules.For example, questions typically evolve from relatively elementary ones to advanced ones.Only when learners master the underlying prerequisite concepts do they have the knowledge base to master subsequent ones.Thus, fully exploring the prerequisite concept structure is crucial for modeling learners' knowledge.However, existing prerequisite inferences methods suffer from heavy expert-labeled reliance and the data sparsity problem.Inspired by unsupervised learning to dig out informative knowledge from the data themselves without relying on manual annotation, we aim to construct the underlying knowledge graph automatically via an unsupervised learning approach based on domain-related statistics.
Considering that the learning sequence order explicitly reflects a concept prerequisite, we learn the latent knowledge structure with the given Q-matrix and large-scale learning sequences in a data-driven manner.These question representations are constructed from the Q-matrix and concept map, enabling the integration of question-distinctive information and the dependency relationships between questions and concepts, but ignoring the inner correlations between latent concepts.Inspired by previous work [25], we first mine the implicit knowledge structure from the massive training datasets, thereby representing the concept relations as a transition probability matrix T ∈ R M×M .Here, T i,j = n i,j ∑ M n i,M if i = j; else, it equals 0; n i,j counts the total number of the unidirectional occurrences from concept c i to concept c j .Then, we define the adjacency matrix of the concept prerequisite relations by A i,j = 1 if T i,j = 0; else it is 0.

Concept Graph Representation
From the viewpoint of data structure, knowledge concepts have a potential graphstructured nature and are worthy of further exploration.After representing the concept prerequisite structure with the matrix A, we construct the global prerequisite graph of all concepts as G = (C, E, X), where the feature matrix X is randomly initialized by distinct concepts.To preserve the directions of prerequisite relations and extract high-order information in the graph, we leverage the graph neural network with edge multilayer perceptron (GNN-MLP) [39] to encode concept embeddings.It aggregates and propagates each message by applying an MLP to the concatenation of the source and target state, node representations are updated with the current concept node c i and its neighboring node representations N i using the following definition: where MLP (•) is the message passing function of the -th MLP layer, and represents the concatenation operation.After graph learning, the concept embedding representations at time step t are updated as c t .

KT with Fine-Grained Forgetting Behaviors Modeling
To realize the primary goal of predicting learning performance and establishing personalized profiles of learners, we design a hierarchical framework to implement the downstream task of knowledge tracing.

Knowledge-Enhanced Representations
We update each question embedding with the obtained concept embedding matrix by its corresponding concept embedding.Thus, question embedding q t at time step t is represented as its underlying concept embedding c t .Likewise, we update the response embedding r t and interaction embedding x t using the trained concept embedding c t .In this way, concept prerequisite information can be incorporated into model input, but deep-level contextual dependencies between question embeddings are still unexplored.Inspired by Transformer's excellent performance in parallelization and representation learning [22], we use a modified version called forgetting-fusion transformer for long-range relation learning, which is introduced in detail in the following section.We employ the forgettingfusion transformer on the past question embeddings to further enhance global dependency learning between these prerequisite-enhanced embeddings.Specifically, the global-aware question representation qi ∈ R D is constructed by packing all the question embeddings {q 1 , . . ., q i } together into matrices Q, K, and V: where f ForgetAtt (•) is the attention function of our forgetting-fusion transformer, and θ 1 is a trainable global scalar initialized randomly and learned automatically during the training process.

Fine-Grained Prior Refinement
After extracting the complex global dependencies among questions, a second problem arises: what we have learned remains in general information, which may be deficient in distinguishing the knowledge mastery levels of learners.Moreover, the large receptive field of a transformer may result in fitting to some irrelevant features, but ignore highly discriminative features that could have a more significant impact on the prediction results.To deal with these potential issues, we augment the impact of learners' personalized characteristics by concatenating predefined educational prior from three aspects: Attempt Times (AT), Long-range Performance (LP), and Learning Interactions (LI), respectively.Despite its simplicity, our experiment results show its great potential for personalization and interpretability in the KT task.
Attempt Times (AT): A learner's proficiency level on the current question is strongly associated with their historical attempts related to concepts.Accordingly, we use AT to count the total number of times each learner answers a question relating to a specific concept, which is defined as follows: where m ∈ (1, M) refers to the concept m underlying the current question, and count(q m ) represents the total number of times the learner answered question q m .Long-range Performance (LP): It is widely accepted that learning performance is roughly equivalent to historical interactions.In fact, the implicit connections between past questions and interactions have a non-negligible impact on learners' behavioral performance.On the one hand, learning performance is strongly associated with question similarity.For example, a learner tends to achieve similar performance on questions related to the same concept.On the other hand, how learners interact with past questions greatly affects their performance to the current question, because historical interactions reflect the evolution of knowledge proficiency.Based on the two key factors of question similarity and learning interactions, we leverage the global-aware question embeddings { q1 , . . ., qi } and interaction embeddings { x1 , . . ., xi } for a more fine-grained analysis of learning behaviors.
Although learning performance can be assessed using questions and interactions, we still face the inherent KT challenge of forgetting behaviors modeling.To meet the requirements of both dependency learning and forgetting modeling, we adopt a forgetting-fusion transformer with a unique implementation.Unlike traditional practice, where query, keys, and values correspond to the same item, we tune the forgetting-fusion transformer to better satisfy our needs by setting question embeddings as query and keys and interaction embeddings as values.The embedding representation of the learner's long-range performance at time step i is calculated by: where θ 2 is a global scalar specifically trained for the learning performance encoder.Thus, we obtain the embedding matrix of long-range performance LP ∈ R t×2D .Then, we concatenate the embedding matrices of AT, LP, and LI.Here, we use GLU [40] to handle the concatenation to reduce gradient dispersion and nonlinear activation.The final outcome of a learner's personalized prior E K ∈ R t×2D is denoted as: Although the forgetting-fusion transformer can extract global relationship dependencies of a long learning sequence, it does not perform well in capturing the more fine-grained dynamics of the knowledge-state evolution.To compensate for this defect, we employ a one-dimensional convolution neural network (1D-CNN) [24] on the learning sequence for a high-level learning behaviors analysis.The sliding window is the key element of the 1D-CNN for feature mapping, where learning interactions are segmented at a fixed length.The critical local features are refined from the continuous time series in a way that can learn discriminative features from the prior concatenation E K .Then, the output of the 1D-CNN is fed into GLU for a nonlinear transformation.To accelerate the training process, residual connections [41] are added from the input to the output between each convolutional block.Finally, we build the hierarchical convolutional neural networks by stacking the previously mentioned N identical convolution blocks.The convolutional operation of the -th CNN layer can be simply expressed as: After local feature extraction by 1D-CNN, the knowledge state at time step t is updated as ĥt , which stands for the learner's knowledge proficiency level and can be further used to predict their future performance.

Prediction
The last module of DCKT predicts the learners' learning performance on the next question.When given the question q t+1 to solve, a learner searches for the relevant knowledge concept within an established cognitive horizon, which was modeled as the current knowledge state ĥt after the t-th learning interaction.Therefore, we first apply the dot product of ĥt and the next question embedding q t+1 , then set up a sigmoid function to generate the future performance representation: The output rt+1 ∈ [0, 1] represents the predicted probability of the learner correctly answering question q t+1 .

Forgetting-Fusion Transformer
Knowledge tracing is essentially a time series task whose datasets naturally arise from real-world educational applications and are recorded over a fixed sampling interval.Transformer [22] has a powerful sequential processing capability by virtue of its core component, positional encoding (P.E.), which can incorporate positional information in an input sequence and process the modified input in parallel.However, P.E.vectors record the location information of items in the input sequence, and embedding representations encode contextual information about the items, whereas simply adding the two cannot simulate the complex patterns of human forgetting behaviors, all of which place higher demands on the modeling of forgetting behaviors during the learning process.The key to solving these issues is a precise quantification of learners' forgetting effect in line with cognitive science studies, forgetting curve, etc.
As illustrated in Figure 3, we modify the original Transformer by fusing the forgetting behaviors with the scaled dot-product attention.Similar to the commonly used attention function in Transformer, we compute the dot products of the query with all keys, scale the dot products by 1  √ , and finally obtain the weights of values via a softmax function.The biggest difference is that in place of the position encoding, we design a forgetting module to depict the overall forgetting effect of the learning process and adapt it to the attention weights.The output matrix is obtained from the following: where effect stands for the forgetting module output, d k is the dimension of keys, and * and T represent multiply and transpose operations, respectively.According to the relevant cognitive science study in [42], we design exponential decay attention to measure the forgetting effect.To weigh the importance of questions in combination with the inevitable forgetting, we consider two critical elements: context-aware distance d(t, τ) [4] and question difficulty parameter θ.Specifically, each question's attention weight is calculated by its global importance to the entire learning sequence and the time intervals between past questions.The trainable global parameter θ, which indicates a global question difficulty, controls the exponential decay rate throughout the model training process.The calculation of the forgetting effect is simply expressed as:

Objective Function
All the parameters are learned in the training process by minimizing the cross-entropy log loss between the predicted label rt and the ground-truth response label r t .We use the following objective function to optimize our model:

Experiments
In this section, to evaluate our proposed DCKT model; we present the experiment settings by answering the following research questions: RQ1: Can our proposed DCKT model outperform other state-of-the-art KT models?RQ2: How do different components in DCKT affect the final performance prediction?RQ3: Does the pretext task for knowledge modeling in DCKT help to learn the meaningful representations of questions?RQ4: How does DCKT precisely track the knowledge state compared with other KT models for personalized learner modeling?

Datasets
We use four real-world public datasets to evaluate the effectiveness of DCKT.Table 3 summarizes the general statistics for each dataset.Details of all datasets are as follows: )) (ASSISTChall) was publicly released from the 2017 ASSISTments data mining competition and has the most informative descriptions of all the ASSISTments datasets.In addition, it contains the most interactions, with 942,816 learning records, ranking first in terms of the number of records per learner ratio.

Baseline Methods
To answer research question 1, we compare the performance of DCKT against several well-known KT methods.To ensure the fairness of the comparison, we adopt the best parameter configurations for all methods.A summary of the baseline methods is as follows: • DKT [6] introduces deep learning techniques into knowledge tracing for the first time.
It utilizes an RNN or LSTM to model the knowledge state as a high-dimensional hidden state in the learning process.• DKVMN [17] uses a memory network to enrich the hidden variable representation of DKT.Such a memory structure consists of two matrices: a static matrix called key to store all the concepts and a dynamic matrix called value to store and retrieve the mastery level of each concept through reading and writing operations.• SAKT [23] is the first attentive knowledge tracing model based on the Transformer architecture.The attention mechanism is used for weighing the importance of past questions relative to the entire learning sequence, thereby predicting learning performance on the current question.• CKT [24] utilizes a CNN to model learners' individualization for KT.It measures a learner's personalization in terms of the learner's personalized prior knowledge and learning rates during their learning process.• AKT [4] uses a context-aware attention mechanism to learn the context-aware representations of exercises and answers.Unlike the scaled dot-product attention used in SAKT, AKT devises a modified monotonic attention version to simulate the forgetting effect by exponentially decaying attention weights.

Ablation Study of DCKT
To answer research question 2, we designed an ablation study with different variants of our proposed model to evaluate the impact of each component on the final prediction results.These variants are as follows: • DCKT-NoPreq: This variant randomly initializes the concept embeddings to replace the knowledge-centric unsupervised representation learning module in DCKT, which learns concept representations by extracting the latent prerequisite relations.This variant aims to examine the effectiveness of concept representation learning combined with prerequisite discovery.• DCKT-NoPrior: This variant removes all the components concerned with prior knowledge.We simply use the interactions to compute the learner's knowledge state.This variant evaluates the impact of the learner's personalized prior on the final results of DCKT.• DCKT-NoTrans: This variant adopts the basic design of DCKT except for all operations by forgetting-fusion transformer, including question and long-range learning performance, which are replaced by the regular dot-product attention.This variant evaluates the impact of our forgetting-fusion transformer on the performance of DCKT.• DCKT-NoForget: This variant is built by removing the forgetting module in the forgetting-fusion transformer.Compared with DCKT-NoTrans, this variant can further evaluate the impact of the forgetting module on the performance of the forgettingfusion transformer.

Dataset Preprocessing
We first preprocess the learning records at each time step for all datasets.For computational efficiency purposes, each dataset has a maximum input sequence length proportional to its average sequence length.If sequences are longer than the fixed length, we split them into several subsequences, while shorter ones are padded up to the fixed length.

Training Settings
To ensure the reliability of the experiment results, we perform standard 5-fold crossvalidation over all the datasets.For each fold, we split 80% of learners into the training set and validation set, and the remaining 20% as the testing set.For empirical evaluation, we tune the hyper-parameters on the training set, choose the best-performing model on the validation set, and evaluate it on the testing set.
In our training settings, all learnable parameters are randomly initialized using the Xavier initialization [44] and optimized using the Adam gradient decent algorithm [45].As for important hyper-parameter settings, a dropout rate with a keep probability of 0.2 is set to prevent overfitting, and the number of epochs is 80 for all datasets.For the ASSIST2009, ASSIST2012, ASSIST2015, and ASSISTChall datasets, the parameter batch sizes are set to 10, 20, 25, and 15, respectively.A series of experiments were conducted to determine the hyper-parameters of the forgetting-fusion transformer, including the number of attention heads h = 8 and the output dimension d model = 512.Thus, the dimensions of queries, keys, and values are d q = d k = d v = d model /h = 48, and the inner-layer dimension of positionwise feed-forward networks d f f = 2048.Our code is implemented with TensorFlow 1.x in Python on a Linux server with NVIDIA GeForce RTX 2080Ti GPUs.

Results and Discussion
In this section, we present the experiment results and discuss the important findings from our experiments.

Learning Performance Prediction (RQ1)
Learning performance prediction assesses a learner's future performance on specific questions, where the predicted binary-valued responses indicate whether the learner has mastered these questions.Thus, we considered it a binary classification task.To evaluate the performance predictions in the KT task, we use Area Under Curve (AUC) as the evaluation metric and compare DCKT with several state-of-the-art KT methods using the average AUC results across five test folds.Table 4 reports the AUC results of all methods over four public datasets, and Figure 4 visualizes the average AUC values with bar plots.The experiment results indicate that DCKT outperforms all other baselines over the four datasets.In comparison with the state-of-the-art methods, DCKT gains average AUC improvements of 1.1%, 7.1%, and 6.1% on the ASSIST2012, ASSIST2015, and ASSISTChall datasets, respectively.In addition, we also noticed some interesting findings.First, we can observe that DCKT achieves significant performance on ASSIST2015 and ASSISTChall, which reflects its strong ability to extract meaningful information in long sequences.It also outperforms pure question-labeled KT datasets without considering question-concept relations.Second, DCKT achieves only slight improvements on the ASSIST2009 and ASSIST2012 datasets, which can be attributed to the complexity of the latent knowledge structure among datasets, as the two datasets contain a larger number of questions, which is a great challenge to knowledge tracing.

Ablation Study (RQ2)
Table 5 summarizes the average AUC results for all variants of DCKT, each of which is an essential component of the complete model.From Table 5, we can draw some important conclusions.First, we can observe the impact of concept prerequisite inferences by comparing DCKT and DCKT-NoPreq, demonstrating DCKT's ability to model knowledge and learn valuable representations.Second, for the ASSIST2009, ASSIST2012, ASSIST2015, and ASSISTChall datasets, DCKT achieves a statistically significant performance upon DCKT-NoPrior by margins of 6.2%, 4.9%, 15.1%, and 11.1%, respectively.This phenomenon suggests that the personalized prior refinement module plays a crucial role in knowledge-state modeling.It can learn meaningful tokens from large-scale learning logs, reflecting the knowledge proficiency level unique to each learner.Third, the comparison of DCKT with DCKT-NoTrans further proves the forgetting-fusion transformer's superior performance, which benefits from its powerful global relation learning ability.Finally, the impact of the forgetting mechanism can be observed by comparing DCKT-NoTrans with DCKT-NoForget.The clear performance gap of these two models over all datasets, especially ASSIST2015 and ASSISTChall, demonstrates the necessity to incorporate forgetting behaviors in learner modeling.In DCKT, the question embeddings are initialized with question identifiers.The feature weight matrix is obtained through unsupervised graph representation learning, so the learned embeddings are supposed to integrate concept prerequisites and question information.To assess the significance of the question embeddings learned by DCKT, we randomly select 200 questions in the ASSIST2009 and ASSISTChall datasets, respectively, and visualize the multidimensional embeddings of questions using T-SNE [46] in Figure 5.
Following the principle that questions underlying the same concept are labeled with the same color, we made some interesting observations.First, as shown in Figure 5, the cluster results in the two datasets show that DCKT can learn question embedding representations well, where questions with the same concept are mostly distributed in the same cluster.Second, similar concepts with more relevant meanings are clustered at close range in the latent embedding space.For example, there are eight distinct concepts in the visualization results for ASSIST2009 in Figure 5a.The largest cluster with concept ID 24, which represents "Addition and Subtraction Fractions", is close to the cluster with concept ID 36, which means "Unit Rate" operation.This phenomenon is consistent with the inner knowledge structure.However, the purple clusters with I.D.s {6,12} that correspond to the relatively unrelated concepts "Stem and Leaf Plot" and "Circle Graph" are the furthest away from all the other clusters.In summary, the clustering results intuitively describe the complex and implicit relationships between concepts and questions, which can provide important references for knowledge discovery.

Knowledge-State Visualization (RQ4)
To accomplish the goal of personalized learner modeling, we examine the effectiveness of DCKT in tracing knowledge state in terms of accuracy and plausibility.Figure 6 shows three visualization cases of the traced knowledge-state results from the same learning sequence, a fragment of a learner's interactions taken from the ASSISTChall dataset.From Figure 6, we draw some important findings that can help build a personalized profile of the learner.
The first case demonstrates that our proposed DCKT can achieve more accurate performance prediction results than the CKT model.As shown in Figure 6a; a heatmap visualizes the prediction probabilities of the learner answering questions correctly for the CKT and DCKT models.The horizontal axis refers to a learning sequence taken from the dataset ASSISTChall, where the learner has answered 23 questions on five concepts.Here, every question is denoted as a tuple consisting of its underlying concept and the correctness of the learner's answer.On the one hand, DCKT performs better in extracting personalized prior knowledge from the learner's practice history.We can observe that DCKT achieves significantly better predictions than CKT in the latter part of the learning sequence because a longer learning sequence implies a more abundant prior of the learner.On the other hand, DCKT also performs excellently in simulating the learner's forgetting behaviors during the learning process.For instance, the learner practiced questions corresponding to the same concept c 50 in time steps 10 and 12-17.Still, all were answered wrongly, mainly due to the forgetting effect resulting from multiple intervals before reviewing concept c 50 .In contrast, DCKT produces lower probabilities for these questions, as the forgetting-fusion transformer enables it to extract the forgetting features of human memory.For the second case, to obtained a reasonable explanation of the learner's knowledge state; a radar chart that describes the evolving process of the learner's knowledge proficiency is shown in Figure 6b.From the changing region between the first and last interactions in their learning, we notice an overall improvement in the knowledge proficiency levels for all concepts, except concept c 50 .To determine the reasons for the learning regression, as shown in Figure 6c, we mine the mutual relationship between the learner's proficiency in concept c 50 and their answers to related questions.We can see that when the learner correctly answered questions corresponding to concept c 50 , their proficiency with idea c 50 also increased.However, it is a nonmonotonic relationship affected by many potential factors, such as the practice of related concepts, review, forgetting behaviors, etc., which all affect the learner's knowledge state differently.While not all visualizations of the learners' knowledge state are precise in an intelligent education scenario, these findings can support personalized learner assessment and targeted instructional improvements.

Conclusions and Future Work
In this paper, we explore the coupling influence of knowledge concept prerequisite/relationships (i.e., knowledge-centric) and learners' forgetting behaviors (i.e., learnercentric) in promoting the performance of KT tasks.We thus proposed a novel KT model, named DCKT.Specifically, we leverage an unsupervised representation learning method to construct a prerequisite graph, and learn concept embeddings as a pretext task (knowledgecentric).Then, these learned embeddings are employed as input for the downstream task to perform knowledge tracing.As for the common forgetting behaviors, we designed a forgetting-fusion transformer to measure the forgetting effect during the learning process (learner-centric).Extensive experimental results over four public datasets prove that DCKT can outperform all other methods of learning performance prediction.Moreover, the visualization results show that DCKT can not only learn valuable embedding representations for knowledge components but also models an accurate and reasonable knowledge state for learners.Our work points out a potential research avenue to advance the KT task by exploiting the complementary effects of knowledge and learner, but effective ways to combine the two need to be further explored.
For future work, we will explore more research opportunities for knowledge discovery and learner personalization modeling.For instance, we may use multimodal datasets or integrate educational contexts to enrich embedding representations for questions and concepts.Furthermore, we intend to pretrain question representations in a self-supervised learning manner that can automatically generate labels.Finally, for modeling knowledge states dynamically, we will investigate how to fully exploit dynamic information in the massive interaction records.

Figure 1 .
Figure 1.An example of knowledge tracing.

Figure 2 .
Figure 2. Overview of the DCKT model, where the top box represents unsupervised graph representation learning (knowledge-centric module), and the bottom box represents KT with fine-grained forgetting behaviors modeling (learner-centric module).

Figure 3 .
Figure 3. Illustration of the forgetting-fusion scaled dot product.

Figure 4 .
Figure 4.The average AUC values of all KT methods over four datasets.

Figure 5 .
Figure 5. Clustering results of question embeddings learned by DCKT in two datasets: (a) ASSIST2009; (b) ASSISTChall, where the color of the question nodes refers to the underlying concept to which they belong.

Figure 6 .
Figure 6.Visualization cases of a learner's knowledge state tracked by DCKT, using a learning sequence taken from the ASSISTChall dataset, where the learner answered 23 questions on five concepts.In (a), a heatmap compares the prediction probabilities of the learner answering questions correctly for the CKT and DCKT models.(b) is a radar chart that gives a before-and-after comparison of the learner's knowledge proficiency in the learning process.(c) depicts the mutual relationship between the learner's evolving proficiency on concept c 50 and their answers.

Table 1 .
A comparison of knowledge tracing models.

Table 2 .
A list of important notations.

Table 3 .
Dataset Statistics.Despite the increasing number of records, ASSIST2015 has the lowest average number of records per learner at around 36. • ASSISTment Challenge (https://sites.google.com/view/assistmentsdatamining/dataset (accessed on 20 November 2022 • ASSISTments2015 (https://sites.google.com/site/assistmentsdata/home/2015-assistments-skill-builder-data (accessed on 20 November 2022)) (ASSIST2015) is composed of 708,631 response records over 100 distinct concepts produced by 19,917 students in 2015.The biggest difference between ASSIST2015 and previous versions of the AS-SISTments datasets is that it provides no metadata or concept.

Table 4 .
The AUC results of all KT methods over four datasets.

Table 5 .
Ablation study of DCKT and its variants over four datasets.