Pre-Training Multi-Concept Question Embeddings for Knowledge Tracing

Lu, Yaowen; Zhang, Xiankun; Zhang, Huitao

doi:10.3390/app15073654

Open AccessArticle

Pre-Training Multi-Concept Question Embeddings for Knowledge Tracing

by

Yaowen Lu

,

Xiankun Zhang

and

Huitao Zhang

^*

College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3654; https://doi.org/10.3390/app15073654

Submission received: 4 March 2025 / Revised: 22 March 2025 / Accepted: 24 March 2025 / Published: 26 March 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With the rise of online education platforms, the vast number of accumulated data conceal patterns regarding the evolution of students’ knowledge states. Tracking a student’s knowledge state through their historical interaction records is known as knowledge tracing, a critical task in intelligent educational systems. Current research predominantly focuses on designing high-performance networks to enhance knowledge state tracking capabilities, often employing simplistic methods for question embedding, such as one-hot encoding or graph-based representations. This paper proposes a pre-training model for multi-concept question embedding (Pre-training Multi-concept Question Embedding, PMQE), aimed at providing a robust upstream tool for knowledge tracing and intelligent education fields. We first leverage the textual information in the dataset, representing concepts using their names or descriptive text embeddings. We then utilize the structural information from the question–concept graph, applying graph convolutional networks to derive question embeddings that integrate both textual semantics and structural information. Additionally, auxiliary information (such as question difficulty) is utilized for joint model optimization. The model outputs include question embeddings, question–concept weight matrices, and question difficulty, which can be used for knowledge tracing and other downstream tasks. Experiments conducted on two real-world datasets and multiple models validate the effectiveness of our approach.

Keywords:

knowledge tracing; pre-training; large language models; graph neural networks; intelligent educational systems

1. Introduction

In intelligent educational systems, knowledge tracing is a crucial task aimed at modeling students’ proficiency levels based on their historical answering records and assessing their current knowledge state [1,2]. The task is typically evaluated by the ability to predict whether a student will answer the next question correctly. To address this problem, numerous methods have been developed. A key element in the definition of the knowledge tracing task is the embedding of the questions themselves during student interactions. Most existing methods do not focus on the embedding of questions but instead jointly train the representation of questions within the framework of a predictive model using supervised methods. However, due to the potential insufficiency of interaction data for some questions or concepts, the issue of data sparsity arises, limiting the expressiveness of embeddings obtained through supervised methods. As illustrated in Figure 1, knowledge tracing involves analyzing students’ mastery of different concepts at each time step based on recorded interactions with a sequence of questions, which include information about the questions, the concepts covered, and the correctness of the responses. Typically, answering correctly indicates an improvement in the student’s mastery of the corresponding concepts, as shown with

q_{4}

, where the student’s proficiency in concept

c_{2}

increases. Conversely, an incorrect response, such as with

q_{3}

, suggests a decline in proficiency for concepts

c_{1}

and

c_{3}

. The objective of this study is to construct question embeddings from a multi-concept perspective, allowing for a more accurate reflection of the diverse concepts covered by a question.

Current pre-training methods for question embeddings do not consider the textual information of concepts or questions, nor do they account for the possibility of a question encompassing multiple concepts.

Based on this analysis and consideration, we propose a pre-training model that integrates textual embeddings and structural information to uniformly model the relationships between questions and concepts, providing high-quality embeddings for knowledge tracing and its downstream tasks. Specifically, we initially employ large language models to obtain textual embeddings of concepts. Then, we construct a question–concept graph from the interaction history and use graph convolutional methods to update the embeddings of questions and concepts, subsequently reconstructing the graph and optimizing embeddings through graph loss calculation. The experimental results on two real-world datasets demonstrate that compared to traditional supervised question embedding methods in knowledge tracing, our pre-trained embeddings can enhance model performance and improve interpretability.

Our contributions are as follows:

We developed a method to obtain both question embeddings and question–concept weight matrices, facilitating downstream tasks such as knowledge tracing and question recommendation. This approach transitions from traditional single-concept to multi concept modeling.
By leveraging the relatively fixed nature of concept texts, we utilized methods from other text embedding models to represent concepts, enriching the structured information of the question–knowledge bipartite graph with additional textual semantic information.
We designed a difficulty analyzer that assesses question difficulty based on its embeddings and auxiliary information, which can be utilized for question evaluation and incorporated into downstream tasks [3,4,5].

2. Related Work

The related work is categorized into the following aspects: the knowledge tracing task itself, representations of question embeddings, and other methods utilized in this paper.

2.1. Knowledge Tracing

The digitalization of education has generated a vast amount of learning interaction records [1,2,5]. Knowledge tracing aims to estimate students’ current knowledge states from these records, making it a fundamental task in the realm of intelligent education. Below, we briefly review related work on knowledge tracing, with a focus on developments before and after the introduction of deep learning.

2.1.1. Knowledge Tracing Before Deep Learning

The task of knowledge tracing was originally proposed in the last century. Bayesian Knowledge Tracing (BKT) [6] assumes that a student’s knowledge state is a set of binary variables, using a Hidden Markov Model to update the probability of mastering concepts based on learner performance. During the subsequent period, factor analysis models were primarily used to accomplish the knowledge tracing task. This theory played a significant role in educational assessment, with models such as Learning Factors Analysis (LFA) [7] and Knowledge Tracing Machines (KTM) [8] representing logical knowledge tracing methods. These models excelled in both interpretability and performance compared to Bayesian Knowledge Tracing.

2.1.2. Deep Learning-Based Knowledge Tracing

In 2015, Deep Knowledge Tracing (DKT) [9] was the first to successfully apply LSTM based [10] deep neural networks to the knowledge tracing task, achieving commendable results. The Dynamic Key-Value Memory Network (DKVMN) [11] utilized dynamic key value matrices to store relationships between concepts and questions, as well as students’ knowledge states. Attentive Knowledge Tracing (AKT) [12] employed a monotonic attention mechanism to simulate student forgetting and enhanced model interpretability using the Rasch model. Graph-based Knowledge Tracing (GKT) [13] introduced graph neural networks to the knowledge tracing task for the first time and proposed various graph construction methods. These efforts have fully integrated knowledge tracing into the era of deep learning.

2.2. Question Embedding Representation

Question embedding is often a crucial component of the knowledge tracing task. Most knowledge tracing models treat it as an internal task, synchronously constructing question representations during the training process or using simple methods to differentiate questions. Some research, however, focuses on modeling question embeddings, even treating it as a standalone task and training a separate model [14].

2.2.1. Traditional Question Embedding

In traditional deep learning approaches to knowledge tracing, such as those represented by DKT, questions are typically embedded using simple one-hot encoding. To mitigate the issue of data sparsity, some studies simplify datasets from the question level to the concept level, using concepts to represent entire questions. This approach alleviates data sparsity to some extent but overlooks the nuances between different questions. Other works incorporate question embeddings into the joint learning process within knowledge tracing models. For instance, AKT [12] introduces the Rasch model, which captures variations in questions under the same concept. Some studies attempt to construct question embeddings using textual information, such as EKT [15] and EERNN [16], which employ a private dataset to embed entire question texts using word2vec [17]. Additional research extracts information from knowledge structure graphs, as seen in SKT [18], JKT [19], and TSKT [20].

2.2.2. Pre-Trained Question Embedding

Recent studies have isolated question embedding as a standalone task, independent from the knowledge tracing task [21,22]. This unified approach to constructing question embeddings, which is not task-specific, can be utilized across multiple knowledge tracing models and other downstream tasks like question recommendation and learning path planning. This allows downstream tasks to focus on their core objectives.

PEBG [23] represents the question–concept relationship as a bipartite graph, using a simple dot product layer to learn low-dimensional embeddings for questions, and demonstrates its effectiveness in question representation through visualization techniques. PERM [24] employs a two-layer attention mechanism to further uncover semantic information in student–question and student–concept interactions from the global student–question–concept relationships.SEEP [25] constructs a heterogeneous graph of student–question–concept from the dataset, decomposing it into two semantic perspectives and using node-level attention mechanisms to obtain embeddings for questions and concepts.BiCo [26] chooses two similar yet distinct perspectives (subjective and objective) as semantic sources for question embedding, designing a pre-training model for question embeddings that achieves state-of-the-art performance on two real-world datasets, proving its robust expressive capabilities.

Overall, PEBG pioneered the use of pre-training for constructing question embeddings, establishing a fundamental structure for obtaining embeddings from question–concept bipartite graphs and difficulty constraints, thereby laying the groundwork for subsequent studies. However, none of these four models account for the possibility that a single question may involve multiple concepts, nor do they utilize the semantic information readily available from the textual embeddings of concepts. Moreover, by explicitly modeling the multi-concept matrix, we can better serve downstream applications.

2.3. Graph Convolutional Networks

Graph Convolutional Networks (GCNs) [27] are neural network architectures designed to process graph-structured data, and they have been widely applied across various domains, including social network analysis, recommendation systems, and bioinformatics. GCNs perform convolution operations on graph structures to learn node representations, effectively capturing dependencies between nodes. LightGCN [28] aims to simplify the design of traditional GCNs by removing unnecessary components, such as non-linear activation functions and complex aggregation functions. Instead, it updates the central node’s features by directly aggregating the features of neighboring nodes. These characteristics enable us to efficiently and cost-effectively extract knowledge structure information.

3. Problem Formulation

For a student

s \in S

, we define their historical learning interaction sequence as

s = {e_{j}}_{j = 1}^{n}

, where

e_{j}

represents the j-th interaction, and S is a set of all students’ sequences. Each interaction is a tuple

e = 〈 q, r 〉

, where q denotes the question involved in this interaction, and

r \in {0, 1}

is a binary value indicating whether the student’s response was correct. Each question q includes a set of concepts and auxiliary information, represented as

q = 〈 {c}, a 〉

. A question may encompass one or more concepts. The set of all concepts is denoted as

C = {c}

, and the set of all questions is denoted as

Q = {q}

.

The auxiliary information for a question is derived from statistical data within the dataset. Different datasets provide varying types of auxiliary information. For instance, in the Assist09 dataset, we can derive statistics such as the question error rate d, average number of hints

h_{a v g}

, maximum number of hints

h_{m a x}

, average first response time

t_{a v g f i r s t}

, average response time

t_{a v g r e s p}

, and average number of attempts

t_{a t t e m p t}

.

The set of concepts associated with a question indicates which concepts are relevant to that question, while the auxiliary information reflects aspects related to the question’s difficulty.

The knowledge tracing task aims to determine a student’s knowledge state given their past interaction sequence s. Due to the nature of real-world data collection, explicit knowledge states cannot be directly used as targets for the task. Therefore, the primary objective of the knowledge tracing task is to predict the probability that the student will answer the next question correctly. Specifically, given the current sequence

s = {e_{1}, e_{2}, \dots, e_{j}}

, the task is to predict the probability of correctness for the next interaction

e_{j + 1}

, denoted as

P_{j + 1}

.

3.1. Definition 1: Question–Concept Bipartite Graph

In most datasets, a significant portion of questions correspond to more than one concept. We construct a graph

G = (Q, C, E)

to analyze the relationships between questions and concepts. Here, Q denotes the set of question nodes, C represents the set of concept nodes, and E signifies the set of edges between question nodes and concept nodes. The values of the edge set E are determined by the co-occurrence relationships between questions and concepts. Specifically,

E = [A_{i j}] \in {0, 1}^{Q \times C}

, with:

A_{i j} = \{\begin{matrix} 1, & if question q_{i} includes concept c_{j} \\ 0, & otherwise \end{matrix}

(1)

3.2. Definition 2: Question Difficulty

Following previous research [29], we consider question difficulty to be a crucial attribute that not only aids the knowledge tracing task but also plays an important role in other tasks within intelligent educational systems. The difficulty of a question

d_{i}

is statistically derived from the dataset as follows:

d_{i} = \frac{\sum_{s \in S} \sum_{e_{j} \in s} I (q_{e_{j}} = q_{i} \land r_{e_{j}} = 0)}{\sum_{s \in S} \sum_{e_{j} \in s} I (q_{e_{j}} = q_{i})}

(2)

where

I (\cdot)

is the indicator function that equals 1 if the condition is satisfied, and 0 otherwise. The exercise and the response corresponding to the interaction record

e_{j}

are represented as

q_{e_{j}}

and

r_{e_{j}}

, respectively.

4. Method

This chapter introduces our proposed PMQE model. As shown in Figure 2, we first obtain text embeddings of concepts using a large language model, which serves as their initial embeddings. These embeddings are then multiplied by a trainable weight matrix to generate the initial embeddings of questions. A concept-question bipartite graph is constructed based on their co-occurrence relationships in the dataset, and LightGCN is applied to capture structural information, producing refined embeddings for both concepts and questions. The similarity between these embeddings is then used to construct a new graph. Graph embedding loss is computed using both the original bipartite graph and the reconstructed graph. After obtaining updated question embeddings, they are processed through a difficulty analysis module, incorporating auxiliary information, to derive difficulty representations. These representations are then compared with statistical difficulty data from the dataset to compute the difficulty loss. The final optimization objective considers both the graph embedding loss and the difficulty contrastive loss.

4.1. Concept Text Embedding

Given that concepts do not directly interact with students, we utilize existing text embedding models for their representation. Recently, large language models (LLMs) [30] have rapidly advanced, and we employ these models’ text embedding methods to embed concept texts, effectively leveraging the semantic information learned from vast corpora. The textual information corresponding to concepts is relatively straightforward, and we denote the resultant text embeddings as

C \in R^{| C | \times d}

, where

| C |

is the number of concepts and d is the embedding dimension. For each concept

c_{i}

, its text embedding is represented as the i-th row of the matrix C.

Considering the different lengths of text embeddings obtained from various models and the special techniques employed by OpenAI, we limit the length of text embeddings from OpenAI to match those from other text embedding models. For the OpenAI text-embedding-3-large model, we followed OpenAI’s recommended methods for shortening text embeddings, extracting different lengths as needed to evaluate the impact of varying embedding lengths.

We start by computing the L2 norm (Euclidean norm)

{∥ x ∥}_{2} = \sqrt{\sum_{i = 1}^{n} x_{i}^{2}}

of a text embedding vector

x = (x_{1}, x_{2}, \dots, x_{n})

, where

x_{i}

represents the i-th component of the embedding vector and n is the number of target dimensions. Then, we normalize the vector

x

:

x_{normalized} = \{\begin{matrix} x, & {if ∥ x ∥}_{2} = 0 \\ \frac{x}{{∥ x ∥}_{2}}, & otherwise \end{matrix}

(3)

4.2. Question Embedding

To achieve the embedding representation of questions, we multiply the concept embeddings by a weight matrix

W_{C 2 Q} \in R^{| C | \times | Q |}

to obtain the matrix

Q \in R^{| Q | \times d}

. This matrix explicitly represents the relational weights between questions and concepts, thereby fully reflecting the influence of concepts in the question embeddings. Here,

| C |

denotes the number of concepts, and

| Q |

denotes the number of questions. For each question

q_{i}

, its text embedding is represented as the i-th row of the matrix Q.

4.3. Embedding Update Based on LightGCN

Building on this explicit modeling, we further optimize the embeddings by employing the message-passing mechanism of LightGCN to obtain updated embeddings

C^{'}

and

Q^{'}

from the question–concept bipartite graph G. Through message passing on the bipartite graph, our question embeddings capture more structural information on top of the semantic information from text embeddings. This shifts the semantic space of the embeddings from pure text to the knowledge tracing domain, enhancing their expressive power in this context.

4.4. Graph Constraint

After generating the new embeddings, we use these embeddings to reconstruct the bipartite graph

\hat{G}

and compare it with the original question–concept bipartite graph G. We calculate the loss

L_{G}

to optimize the embeddings. The reconstruction of the bipartite graph is based on the similarity between concepts and questions, constructing an edge between two nodes if their similarity exceeds a certain threshold. In the reconstructed bipartite graph

\hat{G}

, the relationship is defined as:

{\hat{r}}_{i j} = σ (q_{i}^{T} c_{j}), i \in [1 \dots |Q|], j \in [1 \dots |C|]

(4)

where

σ

is the sigmoid function to ensure the output is between 0 and 1, and

\hat{q}

and

\hat{c}

represent the updated embeddings of questions and knowledge points.

We use the cross-entropy loss function to optimize the reconstructed bipartite graph against the original question–concept graph built from the dataset:

L_{G} (Q, C) = \sum_{i = 1}^{| Q |} \sum_{j = 1}^{| C |} - (r_{i j} log {\hat{r}}_{i j} + (1 - r_{i j}) log (1 - {\hat{r}}_{i j}))

(5)

This loss function encourages the reconstructed graph to closely resemble the original graph by penalizing discrepancies between them.

4.5. Auxiliary Information Constraint

In addition, the model integrates the question embeddings

Q^{'}

obtained from graph convolution with auxiliary information to estimate the question difficulty

\hat{D}

and calculates the loss against the difficulty D derived from dataset statistics. By combining the reconstruction graph loss

L_{G}

and the difficulty loss

L_{D}

, the model optimizes the embeddings. This approach not only refines the question embeddings but also enhances their accuracy based on difficulty analysis.

We treat each question individually, collecting continuous variables from the dataset as inputs for the question difficulty analysis module. For the Assistment-2009 dataset, we use the following data as inputs: total number of hints, average number of hints, first response time, average response time, and number of attempts. We compute the average for each variable across the dataset and normalize these averages across all questions, feeding these normalized values into a multi-layer perceptron (MLP) to obtain

D_{A}

. Simultaneously, the question embeddings are fed into another MLP to obtain

D_{Q}

, and both results are combined through a fully connected (FC) layer to produce the final difficulty representation

\hat{D}

:

\hat{D} = FC ([{MLP}_{A} (\frac{a - μ}{σ}), {MLP}_{Q} (q)])

(6)

where

a

is the vector of auxiliary information for the questions, consisting of continuous variables.

μ

and

σ

are the mean and standard deviation of the auxiliary information across all questions in the dataset.

{MLP}_{A}

and

{MLP}_{Q}

are two different multi-layer perceptrons. FC is a fully connected layer.

\hat{D}

is the final difficulty representation.

For optimizing difficulty with auxiliary information constraints, we use the error rate, statistically derived from the dataset, as the label:

L_{D} (Q) = \sum_{i = 1}^{| Q |} - (D_{i} log {\hat{D}}_{i} + (1 - D_{i}) log (1 - {\hat{D}}_{i}))

(7)

This loss function helps align the model’s difficulty prediction with observed error rates, thereby refining the embeddings through both structural and difficulty constraints.

4.6. Joint Optimization

Our model performs joint optimization by combining both graph constraints and auxiliary information constraints:

L = λ L_{G} + (1 - λ) L_{D}

(8)

where

λ \in (0, 1)

is a tunable hyperparameter that controls the weighting between the graph constraint and the auxiliary information constraint. This balanced approach allows the model to effectively leverage both structural and difficulty information to optimize the question embeddings.

5. Experiments

5.1. Datasets

In selecting datasets, we considered that some datasets do not have recorded textual fields for concepts. We chose the following two real-world datasets for our experiments:

Assistment-2009: This dataset originates from the Assistment platform and includes 346,860 recorded interactions from 4217 students, covering 110 different concepts. It is a fundamental dataset widely used in knowledge tracing tasks.

Junyi: This is a learning platform constructed based on open-source code released by Khan Academy.

We performed uniform preprocessing on the datasets. Considering the richness of real-world data, we used the following methods to ensure low data sparsity: interactions without knowledge point information were removed, student data with too few learning records (fewer than 5) were excluded, and interactions involving knowledge points with too few corresponding questions (fewer than 10) were also removed. Some users with too few interactions were removed, which may result in the insufficient sampling of short term or intermittent learners, potentially reducing the predictive performance for this user group. Our model aims to construct more expressive exercise embeddings. However, in achieving the goal of constructing more expressive exercise embeddings, this approach can improve data quality, reduce noise, optimize computational resources, and enhance the stability and generalization ability of the model. Table 1 presents the statistics of the datasets after preprocessing.

5.2. Compared Models

In this study, we aim to validate the effectiveness of our proposed model. To this end, we integrate the pre-trained model into classic deep knowledge tracing models and carefully evaluate its performance enhancements. Additionally, to further substantiate the conclusions of our study, we compare our approach with several existing pre-training methods. These comparative experiments not only demonstrate the superiority of our model but also highlight its potential advantages in the relevant field.

Knowledge Tracing Models:
(a)
DKT: The first model to apply deep learning to the field of knowledge tracing, utilizing an RNN with sigmoid activation. It simplifies questions by representing them with single concepts to alleviate data sparsity issues.
(b)
DKVMN: This model uses Memory-Augmented Neural Networks to construct the main framework of the knowledge tracing model. Specifically, it employs a static matrix to store knowledge concepts and a dynamic matrix to store and update the mastery level of response concepts.
Pre-Trained Models:
(a)
PEBG: To our knowledge, the earliest method to treat question representation as a standalone task separated from knowledge tracing. It represents the question–concept relationship as a bipartite graph to optimize question embeddings.
(b)
BiCo: Utilizes self-supervised dual graph contrastive learning with objective and subjective semantics to pre-train question embeddings.

5.3. Experimental Setup and Hyperparameters

The experiments in this paper were conducted on an NVIDIA RTX 4090 24G (NVIDIA Corporation, Santa Clara, CA, USA), and the models and related code were implemented using PyTorch 2.1.2. For each dataset, 70% of the data was used for training, and 30% was used for testing. The learning rate was set to 0.001, with LightGCN configured to have three layers, and the number of epochs was set to 1000. The hyperparameters that needed adjustment in PMQE were

λ

and the embedding dimension d. Based on our experiments, we selected

λ = 0.73

and

d = 128

. In the knowledge tracing models, the learning rate was set to 0.006, and the number of epochs was set to 300.

5.4. Performance

We selected several classic models in the knowledge tracing task to validate the effectiveness of our question embeddings, including the classic deep knowledge tracing model DKT and the Dynamic Key-Value Memory Network (DKVMN), replacing their question representation components with the outputs of our pre-trained model. We also compared our method with previous pre-training approaches like PEBG and BiCo. Through these comparisons across different models, we aim to comprehensively assess the performance improvement brought by our designed pre-trained question-embedding module in the knowledge tracing task.

Students’ response sequences are recorded as either correct (1) or incorrect (0). Therefore, we formulate the knowledge tracing task as a binary classification problem. Based on this, we utilize Accuracy (ACC) and the Area Under the Receiver Operating Characteristic Curve (AUC) as our evaluation metrics. ACC measures the proportion of correctly predicted responses among all responses, providing a straightforward assessment of overall model performance. AUC, on the other hand, evaluates the model’s ability to distinguish between correct and incorrect responses by computing the area under the ROC curve, which plots the true positive rate against the false positive rate. A higher AUC value indicates the better discriminative capability of the model.

The experimental results, as shown in Table 2, demonstrate that the models augmented with PMQE exhibit superior performance across all metrics. On the Assist09 dataset, the PMQE+DKVMN model achieved the highest accuracy (ACC) of 77.74% and an AUC value of 78.86%, representing improvements of 5.56% and 4.25% over the baseline DKVMN model, respectively. Similarly, on the Junyi dataset, the PMQE+DKVMN model outperformed other combinations, achieving an accuracy of 73.52% and an AUC of 75.81%, which are 2.92% and 4.01% higher than the DKVMN baseline, respectively. Furthermore, models utilizing PMQE outperformed previous pre-training approaches.

The improvements in various performance metrics indicate that PMQE is an effective method for question representation, significantly enhancing predictive capabilities in knowledge tracing tasks and providing new insights and tools for educational data mining.

5.5. Question–Concept Matrix

We visualized the change in concepts for a particular student (as shown in Figure 3). The figure is divided into three parts.

The top black-and-white matrix represents the coverage of concepts in the questions during interactions. The darker the color, the greater the weight of that concept. Most questions involve a single concept, while a few involve two or three. The middle section uses solid or hollow circles to indicate whether the response was correct during the interaction, with solid circles representing correct answers and hollow circles representing incorrect ones. The bottom color matrix displays the predicted knowledge state of the student for each concept, with each column representing the state after each interaction. Colors closer to yellow indicate better mastery of that concept as predicted by the model.

Our observations are as follows.

For single concept interactions, as seen in the third and fourth columns, both questions pertain to concept

c_{3}

. The third question was answered correctly, while the fourth was answered incorrectly, suggesting that the corresponding knowledge state should decrease and then increase. This matches the changes shown in the knowledge state matrix.

Multiple concept interactions, as observed in the tenth column, involve concepts

c_{5}

and

c_{6}

, and the question was answered correctly. Therefore, we predict an increase in knowledge state for

c_{5}

and

c_{6}

, which is consistent with the changes shown in the figure.

This indicates that our weight matrix can accurately describe the relationship between questions and concepts.

5.6. Ablation Study

To validate the effectiveness of each module in our model, we designed the following ablation experiments:

PMQE: The initial model.
no-text-emd: Removed the text embedding of concepts, using one-hot encoding for concept representation as in most methods.
no-c2q-weight: Removed the question-to-concept weight matrix, decoupling question embeddings from concepts and training with independent weights.
no-side-info: Removed additional auxiliary information used for difficulty analysis, relying solely on question embeddings to compute difficulty, i.e.,

$\hat{D} = {MLP}_{Q} (q)$

(9)
no-difficulty: Removed the entire difficulty analysis module, effectively removing the difficulty constraint, and used only the question–concept bipartite graph to guide model optimization. Without the difficulty analysis module, the model’s loss degenerates to:

$L (Q, C) = L_{G}$

(10)

For the ablation study conducted using the Assist09 dataset with the DKT model, the results are presented in Figure 4. By systematically removing key components of the PMQE model, we observed that each module significantly impacts knowledge tracing performance. Specifically, removing the concept text embedding (no-text-emb) resulted in the greatest performance decline (a decrease of 1.06% in ACC and 0.68% in AUC), underscoring the importance of semantic representations provided by pre-trained language models for modeling concepts. The absence of the question-to-concept weight matrix (no-c2q-weight) led to reductions of 0.71% in ACC and 0.21% in AUC, indicating that explicitly modeling the association weights between questions and concepts effectively captures complex relational structures. When auxiliary information for questions was removed (no-side-info), the performance drop was relatively minor (0.32% in ACC and 0.15% in AUC), suggesting that the fundamental embeddings already capture core features, though auxiliary information still provides supplementary discriminative features for difficulty evaluation. Notably, eliminating the difficulty analysis module (no-difficulty) resulted in ACC and AUC decreases of 0.79% and 0.70%, respectively, confirming the critical role of a dynamic, multi-source difficulty assessment mechanism in accurately estimating student abilities. Overall, the experimental results demonstrate that the model integrates semantic embeddings, structural constraints, and cognitive diagnostic theories to achieve multidimensional optimization in the knowledge tracing task.

6. Conclusions and Future Work

In this paper, we proposed a pre-training method to obtain high-quality question embeddings. Our approach first utilizes large language models to derive text representations of concepts, which are then used to obtain initial question representations through a coefficient matrix. These representations are further refined using graph convolution methods to produce updated embeddings. The model optimizes its weights by reconstructing the graph structure and comparing it against the original question–concept graph. Additionally, the model incorporates auxiliary information such as question difficulty to jointly optimize the model’s weights. Comparative experiments across multiple datasets demonstrate that our method offers superior performance, while ablation studies confirm the effectiveness of each designed module. The model outputs include question embeddings and a weight relationship matrix between questions and concepts, which can be applied to knowledge tracing and its downstream tasks to enhance performance and interpretability.

The prerequisite relationships between concepts were not considered, but we believe these relationships are an important structural aspect rather than relying solely on our question–concept bipartite graph. Secondly, the difficulty analysis module heavily depends on the statistical information of exercises, leading to a cold-start problem for newly introduced exercises. Additionally, due to dataset limitations, we only used the textual information of knowledge point names, which prevents us from fully leveraging the rich semantics of complete exercises.

Future work directions include: (1) considering prerequisite relationships between concepts; and (2) the current model is trained on specific datasets and relies on specific concept and question IDs. We aim to train on multiple datasets to develop a model that works from question text, enabling it to function independently of question or concept IDs for improved generalizability.

Author Contributions

Y.L.: Conceptualization, investigation, methodology, software, and writing; X.Z.: project administration; H.Z.: resources, validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Foundation of China (Grant No. 62377036).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets are available from public websites.

Acknowledgments

We sincerely thank Ke Zhu for his supervisory guidance throughout the research process, particularly in refining the experimental design, and for his active participation in the investigation phase, including the critical analysis of preliminary results. We also appreciate the constructive comments from the anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shen, S.; Liu, Q.; Huang, Z.; Zheng, Y.; Yin, M.; Wang, M.; Chen, E. A Survey of Knowledge Tracing: Models, Variants, and Applications. IEEE Trans. Learn. Technol. 2024, 17, 1858–1879. [Google Scholar] [CrossRef]
Abdelrahman, G.; Wang, Q.; Nunes, B. Knowledge Tracing: A Survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Wu, Z.; Li, M.; Tang, Y.; Liang, Q. Exercise recommendation based on knowledge concept prediction. Knowl.-Based Syst. 2020, 210, 106481. [Google Scholar] [CrossRef]
Ennouamani, S.; Mahani, Z. An overview of adaptive e-learning systems. In Proceedings of the 2017 Eighth International Conference on Intelligent Computing and Information Systems (ICICIS), Cairo, Egypt, 5–7 December 2017; pp. 342–347. [Google Scholar] [CrossRef]
Cingi, C.C. Computer Aided Education. Procedia—Soc. Behav. Sci. 2013, 103, 220–229. [Google Scholar] [CrossRef][Green Version]
Corbett, A.T.; Anderson, J.R. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Model. User-Adapt. Interact. 1994, 4, 253–278. [Google Scholar] [CrossRef]
Cen, H.; Koedinger, K.; Junker, B. Learning Factors Analysis—A General Method for Cognitive Model Evaluation and Improvement. In Intelligent Tutoring Systems; Ikeda, M., Ashley, K.D., Chan, T.W., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 164–175. [Google Scholar] [CrossRef]
Vie, J.J.; Kashima, H. Knowledge Tracing Machines: Factorization Machines for Knowledge Tracing. Proc. AAAI Conf. Artif. Intell. 2019, 33, 750–757. [Google Scholar] [CrossRef]
Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.; Sohl-Dickstein, J. Deep knowledge tracing. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’14, Montreal, QC, Canada, 8–13 December 2014; NIPS: Cambridge, MA, USA, 2015; Volume 1, pp. 505–513. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Shi, X.; King, I.; Yeung, D.Y. Dynamic Key-Value Memory Networks for Knowledge Tracing. In Proceedings of the WWW’17: 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2017; pp. 765–774. [Google Scholar] [CrossRef]
Ghosh, A.; Heffernan, N.; Lan, A.S. Context-Aware Attentive Knowledge Tracing. In Proceedings of the KDD’20: 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 2330–2339. [Google Scholar] [CrossRef]
Nakagawa, H.; Iwasawa, Y.; Matsuo, Y. Graph-Based Knowledge Tracing: Modeling Student Proficiency Using Graph Neural Network. In Proceedings of the WI ’19: IEEE/WIC/ACM International Conference on Web Intelligence, Thessaloniki, Greece, 14–17 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 156–163. [Google Scholar] [CrossRef]
Zanellati, A.; Mitri, D.D.; Gabbrielli, M.; Levrini, O. Hybrid Models for Knowledge Tracing: A Systematic Literature Review. IEEE Trans. Learn. Technol. 2024, 17, 1021–1036. [Google Scholar] [CrossRef]
Liu, Q.; Huang, Z.; Yin, Y.; Chen, E.; Xiong, H.; Su, Y.; Hu, G. EKT: Exercise-Aware Knowledge Tracing for Student Performance Prediction. IEEE Trans. Knowl. Data Eng. 2021, 33, 100–115. [Google Scholar] [CrossRef]
Su, Y.; Liu, Q.; Liu, Q.; Huang, Z.; Yin, Y.; Chen, E.; Ding, C.; Wei, S.; Hu, G. Exercise-Enhanced Sequential Modeling for Student Performance Prediction. Proc. AAAI Conf. Artif. Intell. 2018, 32, 2435–2443. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NY, USA, 5–8 December 2013; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
Tong, S.; Liu, Q.; Huang, W.; Hunag, Z.; Chen, E.; Liu, C.; Ma, H.; Wang, S. Structure-Based Knowledge Tracing: An Influence Propagation View. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; pp. 541–550, ISSN 2374-8486. [Google Scholar] [CrossRef]
Song, X.; Li, J.; Tang, Y.; Zhao, T.; Chen, Y.; Guan, Z. JKT: A joint graph convolutional network based Deep Knowledge Tracing. Inf. Sci. 2021, 580, 510–523. [Google Scholar] [CrossRef]
Yang, H.; Hu, S.; Geng, J.; Huang, T.; Hu, J.; Zhang, H.; Zhu, Q. Heterogeneous graph-based knowledge tracing with spatiotemporal evolution. Expert Syst. Appl. 2024, 238, 122249. [Google Scholar] [CrossRef]
Yin, Y.; Liu, Q.; Huang, Z.; Chen, E.; Tong, W.; Wang, S.; Su, Y. QuesNet: A Unified Representation for Heterogeneous Test Questions. In Proceedings of the KDD’19: 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1328–1336. [Google Scholar] [CrossRef]
Huang, Z.; Lin, X.; Wang, H.; Liu, Q.; Chen, E.; Ma, J.; Su, Y.; Tong, W. DisenQNet: Disentangled Representation Learning for Educational Questions. In Proceedings of the KDD’21: 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event, 14–18 August 2021; Association for Computing Machinery: New York, NY, USA, 2020; pp. 696–704. [Google Scholar] [CrossRef]
Liu, Y.; Yang, Y.; Chen, X.; Shen, J.; Zhang, H.; Yu, Y. Improving knowledge tracing via pre-training question embeddings. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 7–15 January 2021. [Google Scholar]
Wang, W.; Ma, H.; Zhao, Y.; Yang, F.; Chang, L. PERM: Pre-training Question Embeddings via Relation Map for Improving Knowledge Tracing. In Database Systems for Advanced Applications; Bhattacharya, A., Lee Mong Li, J., Agrawal, D., Reddy, P.K., Mohania, M., Mondal, A., Goyal, V., Uday Kiran, R., Eds.; Springer: Cham, Switzerland, 2022; pp. 281–288. [Google Scholar]
Wang, W.; Ma, H.; Zhao, Y.; Yang, F.; Chang, L. SEEP: Semantic-enhanced question embeddings pre-training for improving knowledge tracing. Inf. Sci. 2022, 614, 153–169. [Google Scholar] [CrossRef]
Wang, W.; Ma, H.; Zhao, Y.; Li, Z. Pre-training Question Embeddings for Improving Knowledge Tracing with Self-supervised Bi-graph Co-contrastive Learning. ACM Trans. Knowl. Discov. Data 2024, 18, 74. [Google Scholar] [CrossRef]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef] [PubMed]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the SIGIR’20: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 639–648. [Google Scholar] [CrossRef]
Huang, Z.; Liu, Q.; Chen, E.; Zhao, H.; Gao, M.; Wei, S.; Su, Y.; Hu, G. Question Difficulty Prediction for READING Problems in Standard Tests. Proc. AAAI Conf. Artif. Intell. 2017, 31, 1352–1359. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2024, arXiv:2303.18223. [Google Scholar] [CrossRef]

Figure 1. Knowledgetracing mission.

Figure 2. Structure of PMQE.

Figure 3. Knowledge state changes for a student.

Figure 4. Abalation study.

Table 1. Dataset statistics.

Dataset	Assist09	Junyi
Students	2932	10,000
Concepts	105	835
Questions	16,811	835
Interactions	268,271	353,835

Table 2. Performance comparison of PMQE.

	Assist09		Junyi
Model	ACC	AUC	ACC	AUC
DKT	71.42	73.73	69.42	70.3
DKVMN	72.18	74.61	70.6	71.8
PEBG+DKT	74.7	76.46	71.7	73.86
PEBG+DKVMN	75.83	77.87	72.23	74.75
BiCo+DKT	75.41	76.2	71.89	73.44
BiCo+DKVMN	77.19	78.31	73.15	75.17
PMQE+DKT	75.95	77.18	71.67	73.73
PMQE+DKVMN	77.74	78.86	73.52	75.81

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Y.; Zhang, X.; Zhang, H. Pre-Training Multi-Concept Question Embeddings for Knowledge Tracing. Appl. Sci. 2025, 15, 3654. https://doi.org/10.3390/app15073654

AMA Style

Lu Y, Zhang X, Zhang H. Pre-Training Multi-Concept Question Embeddings for Knowledge Tracing. Applied Sciences. 2025; 15(7):3654. https://doi.org/10.3390/app15073654

Chicago/Turabian Style

Lu, Yaowen, Xiankun Zhang, and Huitao Zhang. 2025. "Pre-Training Multi-Concept Question Embeddings for Knowledge Tracing" Applied Sciences 15, no. 7: 3654. https://doi.org/10.3390/app15073654

APA Style

Lu, Y., Zhang, X., & Zhang, H. (2025). Pre-Training Multi-Concept Question Embeddings for Knowledge Tracing. Applied Sciences, 15(7), 3654. https://doi.org/10.3390/app15073654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pre-Training Multi-Concept Question Embeddings for Knowledge Tracing

Abstract

1. Introduction

2. Related Work

2.1. Knowledge Tracing

2.1.1. Knowledge Tracing Before Deep Learning

2.1.2. Deep Learning-Based Knowledge Tracing

2.2. Question Embedding Representation

2.2.1. Traditional Question Embedding

2.2.2. Pre-Trained Question Embedding

2.3. Graph Convolutional Networks

3. Problem Formulation

3.1. Definition 1: Question–Concept Bipartite Graph

3.2. Definition 2: Question Difficulty

4. Method

4.1. Concept Text Embedding

4.2. Question Embedding

4.3. Embedding Update Based on LightGCN

4.4. Graph Constraint

4.5. Auxiliary Information Constraint

4.6. Joint Optimization

5. Experiments

5.1. Datasets

5.2. Compared Models

5.3. Experimental Setup and Hyperparameters

5.4. Performance

5.5. Question–Concept Matrix

5.6. Ablation Study

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI