Learning Student Knowledge States from Multi-View Question–Skill Networks

Li, Jiawei; Xiang, Dan; Li, Chunlin; Mao, Shun; Chen, Yuhuan; Sun, Miao; He, Wei; Deng, Yuanfei; Sun, Chengli

doi:10.3390/sym17122073

Open AccessArticle

Learning Student Knowledge States from Multi-View Question–Skill Networks

by

Jiawei Li

¹,

Dan Xiang

¹,

Chunlin Li

²,

Shun Mao

¹,

Yuhuan Chen

¹,

Miao Sun

¹

,

Wei He

¹,

Yuanfei Deng

³ and

Chengli Sun

^1,4,*

¹

School of Artificial Intelligence, Guangzhou Maritime University, Guangzhou 510725, China

²

School of Cyberspace Security, University of Science and Technology of China, Hefei 230026, China

³

School of Artificial Intelligence, Guangdong Open University (Guangdong Polytechnic Institute), Guangzhou 510091, China

⁴

School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(12), 2073; https://doi.org/10.3390/sym17122073

Submission received: 12 November 2025 / Revised: 28 November 2025 / Accepted: 1 December 2025 / Published: 3 December 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Accurately modeling students’ evolving knowledge states is a core challenge in intelligent education systems. Existing knowledge tracing methods often rely on single-view representations, limiting their ability to capture high-order dependencies among questions and underlying skills. In this paper, we propose a Multi-View Question–Skill Network (MVQSN), which constructs complementary relational views from readily available data to model student knowledge states comprehensively. Specifically, we first build a question–skill bipartite graph capturing direct question–skill associations, a skill complementarity view where edges represent synergistic relationships between skills derived from co-occurrence patterns, and a question co-response view that captures hierarchical question–question associations emerging from the hierarchical patterns of aggregated student responses. Each view is independently encoded using graph neural networks, producing view-specific student knowledge embeddings. To enhance representation consistency across views while preserving unique view information, we employ a multi-view contrastive learning module that aligns embeddings from different views. The aligned embeddings are then fused through an attention-guided mechanism, generating a unified and expressive student knowledge state representation. Finally, the unified embeddings are fed into a prediction layer to estimate students’ future performance. We conduct extensive experiments on three publicly available datasets, demonstrating that MVQSN significantly outperforms state-of-the-art baselines in prediction performance.

Keywords:

knowledge tracing; multi-view learning; graph neural networks; contrastive learning

1. Introduction

Accurately understanding and modeling students’ evolving knowledge states is a fundamental challenge in intelligent education systems [1]. With the rapid growth of online learning platforms and digital tutoring systems, massive amounts of student interaction data, such as response logs and problem-solving histories, have become available [2]. Effectively leveraging these data to provide personalized learning experiences, adaptive feedback, and early intervention strategies requires reliable models that can capture the dynamic learning process of each student [3]. Knowledge tracing (KT) aims to address this need by predicting a student’s future performance based on their historical interactions, thereby offering insights into their mastery of underlying concepts or skills [4].

KT has seen considerable development in both traditional probabilistic models and modern deep learning-based approaches. Early methods, such as Bayesian Knowledge Tracing [3], represent students’ latent mastery states as binary variables and update the probability of skill mastery based on sequential responses. While KT provides a clear probabilistic framework, it often struggles with complex patterns due to strong independence assumptions. Deep learning-based models, such as Deep Knowledge Tracing [5], employ recurrent neural networks [6,7] to capture sequential dependencies in student interactions, significantly improving predictive accuracy. Attention-based models further refine this process by focusing on contextual dependencies among questions and skills [8,9,10,11]. Graph-based methods exploit relational structures between questions and skills to enhance knowledge state representations, providing richer modeling of the underlying learning processes [7,12,13,14].

Despite these advances, most existing methods predominantly rely on single-view representations derived solely from observed question–skill interactions [15]. This reliance limits the model’s ability to capture hierarchical relationships among skills, complementary skill interactions, or latent question–question associations that emerge from aggregated student behavior. Furthermore, single-view approaches often struggle to disentangle different aspects of student knowledge, potentially reducing interpretability and the precision of future performance predictions [16]. These limitations motivate the need for more comprehensive representations that integrate multiple relational views to better capture the complexity of students’ evolving knowledge states.

To address the aforementioned limitations of existing knowledge tracing approaches, we propose a novel Multi-View Question–Skill Network (MVQSN) that comprehensively models students’ evolving knowledge states from multiple relational perspectives. Unlike traditional single-view methods, MVQSN constructs three complementary views: a question–skill bipartite graph capturing direct question–skill associations, a skill complementarity view modeling synergistic relationships among skills derived from co-occurrence patterns, and a question co-response view that uncovers latent question–question dependencies by aggregating student response behaviors. Each view is independently encoded with graph neural networks to obtain view-specific embeddings, which are then aligned through a multi-view contrastive learning module to preserve consistency across views while retaining unique information. Finally, an attention-guided fusion mechanism integrates the aligned embeddings into a unified representation, which is fed into a prediction layer to estimate students’ future performance. Extensive experiments on three publicly available datasets demonstrate that MVQSN significantly outperforms state-of-the-art baselines, validating the effectiveness of multi-view modeling and cross-view contrastive learning in capturing nuanced student knowledge dynamics.

The main contributions of this work are summarized as follows:

We introduce three novel relational views for knowledge tracing: a question–skill bipartite graph, a skill complementarity view capturing synergistic skill relationships, and a question co-response view modeling latent question associations from aggregated student answer patterns.
We design a cross-view contrastive learning module and an attention-guided fusion mechanism to produce unified and expressive knowledge embeddings.
Extensive experiments on three publicly available datasets demonstrate the superior predictive performance of MVQSN compared to state-of-the-art baselines.

The remainder of this paper is organized as follows: Section 2 details the proposed MVQSN model. Section 3 presents experimental results, including ablation studies and case analyses. Finally, Section 4 concludes the paper and discusses future directions.

2. Methods

2.1. Problem Definition

In the domain of intelligent education systems, KT aims to model the evolving knowledge states of students based on their historical interactions with educational content. Let

S = {s_{1}, s_{2}, \dots, s_{N}}

denote the set of students,

Q = {q_{1}, q_{2}, \dots, q_{M}}

the set of questions, and

K = {k_{1}, k_{2}, \dots, k_{L}}

the set of underlying skills. Each student

s_{i}

has a sequence of interactions

R_{i} = {(q_{j}, y_{i j})}

, where

y_{i j} \in {0, 1}

indicates whether

s_{i}

answered

q_{j}

correctly. Each question

q_{j}

is associated with a subset of skills

K_{j} \subseteq K

.

The goal of KT is to predict the probability that a student

s_{i}

will correctly answer a target question

q_{j}

, given their interaction history:

{\hat{y}}_{i j} = P (y_{i j} = 1 ∣ R_{i}) .

(1)

This task is non-trivial due to several factors: (1) heterogeneous student learning behaviors, (2) high-order dependencies among skills, and (3) sparse and noisy observation sequences. Traditional models rely on single-view representations that fail to capture such multifaceted relationships. To overcome these limitations, we propose the MVQSN model.

2.2. Overview of MVQSN

To address these challenges, we propose the MMVQSN, a framework that integrates multiple complementary relational views to generate rich, high-dimensional representations of student knowledge states. As shown in Figure 1, MVQSN is designed to capture both direct question–skill relationships and higher-order correlations, including latent skill dependencies and question co-response patterns.

Unlike traditional knowledge tracing methods that embed questions or skills in isolation, MVQSN explicitly constructs three interrelated graph views: (1) a question–skill bipartite graph representing direct associations; (2) a skill complementarity graph capturing synergistic dependencies among skills; and (3) a question co-response graph, which encodes the latent relationships between questions inferred from aggregated student answer patterns. Each graph is independently encoded via a graph neural network (GNN) to generate node embeddings that preserve both local neighborhood structure and global relational information.

Recognizing that different views capture complementary signals, MVQSN employs a cross-view contrastive learning module, which aligns embeddings of the same student–question interaction across views while retaining view-specific characteristics. This module enhances consistency among views and facilitates the propagation of high-order knowledge dependencies. Subsequently, an attention-guided fusion mechanism integrates embeddings from all views into a unified student knowledge state vector. This fused representation is then passed through a prediction layer to estimate the probability of a correct response.

2.3. Multi-View Construction

Unlike prior knowledge tracing methods that focus primarily on direct question–skill associations or pre-defined skill hierarchies, MVQSN introduces two novel views to capture additional relational information. The skill complementarity view models synergistic relationships among skills derived from co-occurrence patterns in student responses, enabling the discovery of latent skill interactions. The question co-response view aggregates student response behaviors to uncover hidden question–question dependencies, capturing cognitive and behavioral patterns that are not reflected in simple question–skill mappings or sequential co-occurrences. Together, these views complement the traditional question–skill bipartite graph, providing a richer, multi-view representation of student knowledge states.

2.3.1. Question–Skill Bipartite Graph

A key component of the MVQSN framework is the question–skill bipartite graph, which explicitly captures the relationships between assessment questions and their associated skills. This graph forms the structural backbone of our model, directly linking observable student responses to underlying cognitive factors and enabling effective propagation of information across questions and skills. By modeling these explicit associations, the graph allows the model to leverage known skill structures to improve the estimation of student knowledge states and to capture high-order interactions between questions and skills. The bipartite graph is formally defined as

G_{q k} = (V_{q k}, E_{q k}),

(2)

where

V_{q k} = Q \cup K

and

E_{q k} \subseteq Q \times K

represent the observed associations between questions and the skills they assess.

The relationship between questions and skills is represented by an adjacency matrix

A_{q k} \in {0, 1}^{N_{q} \times N_{k}}

, where

A_{q k} (i, j) = \{\begin{matrix} 1, & if question q_{i} requires skill k_{j}, \\ 0, & otherwise . \end{matrix}

(3)

This matrix can be obtained directly from the metadata of educational datasets such as ASSISTments or EdNet, where the mapping between items and skills is explicitly defined by domain experts or curriculum standards.

Each question and skill node is associated with an initial feature vector. The initial embeddings are defined as:

H_{q}^{(0)} = {[h_{q_{1}}^{(0)}, h_{q_{2}}^{(0)}, \dots, h_{q_{N_{q}}}^{(0)}]}^{⊤} \in R^{N_{q} \times d},

(4)

H_{k}^{(0)} = {[h_{k_{1}}^{(0)}, h_{k_{2}}^{(0)}, \dots, h_{k_{N_{k}}}^{(0)}]}^{⊤} \in R^{N_{k} \times d},

(5)

where d denotes the embedding dimension. These embeddings can be initialized using one-hot encodings, pre-trained text embeddings, or randomly sampled vectors.

To propagate and aggregate information across the bipartite structure, we employ a graph convolutional network (GCN) [14]. Since

G_{q k}

is bipartite, message passing alternates between the question and skill domains. The propagation rules for layer l are given by:

H_{q}^{(l + 1)} = σ ({\tilde{A}}_{q k} H_{k}^{(l)} W_{k}^{(l)}),

(6)

H_{k}^{(l + 1)} = σ ({\tilde{A}}_{q k}^{⊤} H_{q}^{(l)} W_{q}^{(l)}),

(7)

where

{\tilde{A}}_{q k}

is the symmetrically normalized adjacency matrix defined as

{\tilde{A}}_{q k} = D_{q}^{- \frac{1}{2}} A_{q k} D_{k}^{- \frac{1}{2}},

(8)

where

D_{q}

and

D_{k}

are diagonal degree matrices for questions and skills, respectively;

W_{k}^{(l)}

and

W_{q}^{(l)}

are trainable weight matrices; and

σ (\cdot)

denotes a nonlinear activation function LeakyReLU.

Through this alternating propagation, the embedding of each question

h_{q_{i}}

encodes not only its direct skill associations but also indirect relations to other questions sharing similar skill dependencies. Conversely, the embedding of each skill

h_{k_{j}}

captures the aggregated characteristics of questions assessing that skill. This bidirectional propagation enables each question to integrate information from both directly associated skills and skills reached through multi-hop links, while each skill embedding summarizes the shared characteristics of all questions evaluating that skill. As a result, the model captures structural dependencies defined by the bipartite topology and semantic dependencies arising from the skill–question attribute patterns, forming a more detailed interaction space between questions and skills.

After L layers of propagation, the final embeddings

H_{q}^{(L)}

and

H_{k}^{(L)}

represent the contextualized question and skill representations in the question–skill view:

Z_{q k} = [H_{q}^{(L)}; H_{k}^{(L)}],

(9)

which serve as the first-view representations for subsequent cross-view contrastive learning and knowledge state modeling.

2.3.2. Skill Complementarity View

While the question–skill bipartite graph captures the direct dependencies between questions and their corresponding skills, it does not reflect the implicit inter-skill relations emerging from students’ learning trajectories. In real-world learning processes, certain skills often exhibit complementary relationships, where mastery of one skill facilitates understanding of another, even if the two are not explicitly linked in curriculum metadata. To capture these latent dependencies, we construct a skill complementarity view

G_{s c}

, which encodes co-occurrence and mutual enhancement relationships among skills derived from aggregated student interaction data. This view allows the model to leverage patterns in actual learning behaviors, improving the estimation of students’ evolving skill mastery and enabling more accurate predictions of future performance.

Let

K = {k_{1}, k_{2}, \dots, k_{N_{k}}}

denote the skill set and let

G_{k c} = (K, E_{k c})

represent the undirected complementarity graph, where each edge

(k_{i}, k_{j}) \in E_{k c}

indicates a statistically significant complementarity relation between

k_{i}

and

k_{j}

. Such relations are inferred from students’ historical performance patterns.

Formally, the adjacency matrix

A_{k c} \in R^{N_{k} \times N_{k}}

is constructed based on the co-occurrence of skills within the same student sessions. Let

R_{u} = {(q_{t}, r_{t})}_{t = 1}^{T_{u}}

denote the sequence of question–response pairs for student u and

K_{q_{t}}

be the set of skills associated with question

q_{t}

. The empirical co-occurrence count between two skills

k_{i}

and

k_{j}

is given by

c_{i j} = \sum_{u} \sum_{t = 1}^{T_{u}} I (k_{i} \in K_{q_{t}}) \cdot I (k_{j} \in K_{q_{t}}),

(10)

where

I (\cdot)

is an indicator function.

To mitigate the influence of skill frequency imbalance, we normalize co-occurrence counts using pointwise mutual information (PMI):

PMI (k_{i}, k_{j}) = log \frac{p (k_{i}, k_{j})}{p (k_{i}) p (k_{j})} = log \frac{c_{i j} \cdot C}{c_{i} c_{j}},

(11)

where

C = \sum_{i, j} c_{i j}

is the total number of co-occurrence events and

c_{i} = \sum_{j} c_{i j}

. Positive PMI values indicate that the co-occurrence of

(k_{i}, k_{j})

is higher than expected by chance, suggesting a complementary relation. The weighted adjacency is thus defined as

A_{k c} (i, j) = max (0, PMI (k_{i}, k_{j})) .

(12)

To prevent the graph from becoming overly dense due to noisy statistical correlations, we apply a sparsification step by retaining the top-m complementary edges for each skill node:

A_{k c} (i, j) = 0, if A_{k c} (i, j) is not among top - m values in row i .

(13)

In addition, semantic similarity derived from textual skill descriptions (e.g., cosine similarity of skill embeddings) can be incorporated to refine

A_{k c}

via a convex combination:

{\hat{A}}_{k c} = λ A_{k c} + (1 - λ) K_{s e m},

(14)

where

K_{s e m}

is a semantic similarity matrix and

λ \in [0, 1]

balances behavioral and semantic complementarity.

The skill complementarity view is encoded using a Graph Attention Network (GAT) to capture high-order relational signals among skills. Given initial skill embeddings

H_{k}^{(0)}

(shared with the question–skill bipartite view), message propagation in layer l is defined as:

H_{k}^{(l + 1)} = σ ({\tilde{A}}_{k c} H_{k}^{(l)} W_{k c}^{(l)}),

(15)

where

{\tilde{A}}_{k c} = D_{k}^{- \frac{1}{2}} {\hat{A}}_{k c} D_{k}^{- \frac{1}{2}}

is the normalized adjacency matrix,

W_{k c}^{(l)}

is a layer-specific weight matrix, and

σ (\cdot)

denotes a nonlinear activation function.

For GAT-based propagation, attention coefficients are computed as:

α_{i j} = \frac{exp (LeakyReLU (a^{⊤} [W h_{k_{i}} | | W h_{k_{j}}]))}{\sum_{n \in N (i)} exp (LeakyReLU (a^{⊤} [W h_{k_{i}} | | W h_{k_{n}}]))},

(16)

and node updates are aggregated as:

h_{k_{i}}^{(l + 1)} = σ (\sum_{j \in N (i)} α_{i j} W h_{k_{j}}^{(l)}) .

(17)

This view allows the model to capture higher-order skill dependencies that cannot be observed from the explicit question–skill mapping. By learning from the complementarity structure, the embeddings of skills are enriched with contextual cues about which other skills co-occur or jointly contribute to successful performance. Consequently, this view enhances the representational expressiveness of the knowledge tracing model, enabling it to generalize across tasks requiring different yet synergistic skill combinations.

After L propagation layers, the learned skill embeddings

H_{k}^{(L)}

constitute the output of the skill complementarity view, denoted as:

Z_{k c} = H_{k}^{(L)} .

(18)

These embeddings are subsequently aligned with those from other views via multi-view contrastive learning, as described in Section 2.4.

2.3.3. Question Co-Response View

The third view in the MVQSN framework, termed the question co-response view, aims to uncover latent dependencies among questions that emerge from students’ collective answering behaviors. Unlike the explicit question–skill mapping or the inferred skill complementarity structure, this view directly models empirical relationships between questions as observed through students’ response patterns. The underlying assumption is that questions eliciting similar response distributions across a large student population are likely to require related cognitive abilities or share similar difficulty characteristics, even if they are mapped to distinct skills. By capturing these implicit co-response patterns, this view enables the model to exploit hidden correlations between questions, providing complementary information that enhances the richness and expressiveness of the learned student knowledge representations.

Let

S = {s_{1}, s_{2}, \dots, s_{N_{s}}}

denote the set of students and

Q = {q_{1}, q_{2}, \dots, q_{N_{q}}}

the set of questions. We define the student–question response matrix

R \in {[0, 1]}^{N_{s} \times N_{q}}

, where

R_{s, q} = \{\begin{matrix} r_{s, q}, & if student s has attempted question q, \\ 0, & otherwise . \end{matrix}

(19)

where

r_{s, q} \in {0, 1}

represents whether the student’s response is correct (1) or incorrect (0). Missing entries correspond to unattempted questions and are treated as zeros or imputed by the student’s average accuracy depending on dataset sparsity.

To quantify the behavioral similarity between two questions

q_{i}

and

q_{j}

, we analyze the joint distribution of students’ responses on these questions. A natural measure is the Pearson correlation coefficient [17]:

p (q_{i}, q_{j}) = \frac{\sum_{s} (R_{s, i} - {\bar{r}}_{i}) (R_{s, j} - {\bar{r}}_{j})}{\sqrt{\sum_{s} {(R_{s, i} - {\bar{r}}_{i})}^{2}} \sqrt{\sum_{s} {(R_{s, j} - {\bar{r}}_{j})}^{2}}},

(20)

where

{\bar{r}}_{i}

and

{\bar{r}}_{j}

denote the average correctness of questions

q_{i}

and

q_{j}

, respectively. The correlation reflects the consistency of students’ performance across questions, thereby revealing latent relationships beyond explicit skill annotations.

However, Pearson correlation may be sensitive to skewed response distributions. To improve robustness, we adopt a mutual information-based dependency measure:

MI (q_{i}, q_{j}) = \sum_{x \in {0, 1}} \sum_{y \in {0, 1}} p_{i j} (x, y) log \frac{p_{i j} (x, y)}{p_{i} (x) p_{j} (y)},

(21)

where

p_{i j} (x, y)

is the joint probability of responses

(x, y)

over all students and

p_{i} (x)

,

p_{j} (y)

are the corresponding marginals. Unlike correlation, which captures only linear covariation between two response vectors, mutual information reflects any change in the full joint response distribution. When two questions exhibit asymmetric error patterns (e.g., students often miss

q_{j}

only when they miss

q_{i}

), conditional relationships (e.g.,

q_{j}

is difficult only after failing

q_{i}

), or threshold-like mastery behaviors, these effects do not create linear trends but do alter the joint probability

p_{i j} (x, y)

. MI increases whenever such nonlinear co-response patterns appear, allowing it to capture question dependencies that correlation-based measures would overlook.

We define the adjacency matrix

A_{q q} \in R^{N_{q} \times N_{q}}

for the question co-response view as:

A_{q q} (i, j) = \{\begin{matrix} MI (q_{i}, q_{j}), & if i \neq j, \\ 0, & otherwise . \end{matrix}

(22)

To avoid noise from spurious associations, a sparsification step is applied, retaining only the top-k connections for each question based on

A_{q q} (i, j)

. Furthermore, the adjacency matrix is normalized as:

{\tilde{A}}_{q q} = D_{q}^{- \frac{1}{2}} A_{q q} D_{q}^{- \frac{1}{2}},

(23)

where

D_{q}

is the diagonal degree matrix with entries

D_{q} (i, i) = \sum_{j} A_{q q} (i, j)

.

Given initial question embeddings

H_{q}^{(0)}

shared with the question–skill bipartite view, and like the skill complementarity view, we perform message propagation over the question co-response view using a GAT layer. The question co-response view offers a complementary behavioral perspective on question relationships, allowing the model to learn from empirical evidence on how students jointly respond to different items. Unlike handcrafted skill tags, which may be incomplete or inconsistent across datasets, this view leverages aggregated student performance to infer latent cognitive linkages among questions. It thus enriches the question representations with behavior-driven dependencies that more accurately reflect actual learning dynamics.

After L layers of propagation, we obtain the final question embeddings in this view:

Z_{q q} = H_{q}^{(L)},

(24)

which serves as the question-level representation for the question co-response view. These embeddings are aligned with those from other views via the multi-view contrastive learning framework described in Section 2.4, thereby contributing to a holistic and consistent modeling of student knowledge states.

2.4. Cross-View Contrastive Learning

The MVQSN model is designed to integrate heterogeneous relational information from multiple perspectives. However, embeddings obtained from different graph views may reside in distinct latent spaces and encode view-specific biases, which can hinder their effective fusion. To address this issue, we introduce a cross-view contrastive learning (CVCL) module that aligns representations across views while preserving complementary information intrinsic to each view. This mechanism encourages consistency in semantically shared representations (e.g., the same question or skill across views) and simultaneously maintains discriminative capacity for unique relational structures.

Given three graph views, the question–skill bipartite graph

G_{q k}

, the skill complementarity view

G_{k c}

, and the question co-response view

G_{q q}

, each encoder produces view-specific node embeddings:

Z_{q k}, Z_{k c}, Z_{q q} \in R^{N \times d},

(25)

where N denotes the number of nodes (questions or skills) and d is the embedding dimension. Although these embeddings capture different aspects of educational interaction, they must ultimately describe the same underlying entities (e.g., the same question appearing in multiple views). The CVCL module is therefore introduced to bridge these latent spaces by maximizing cross-view agreement while minimizing noise from view-specific perturbations.

Before computing the contrastive loss, each embedding vector is projected into a common latent subspace and normalized:

z_{i}^{(v)} = \frac{W_{v} h_{i}^{(v)}}{∥ W_{v} h_{i}^{(v)} ∥_{2}}, v \in {q k, k c, q q},

(26)

where

W_{v} \in R^{d^{'} \times d}

is a learnable projection head and

d^{'}

is the contrastive embedding dimension. The normalization ensures that similarity comparisons are conducted on the unit hypersphere, stabilizing optimization.

For each entity i (e.g., a question node), we treat its embeddings across different views as positive pairs, since they represent the same semantic object under distinct relational perspectives. Formally, for two views

v_{1}

and

v_{2}

, the positive pair is

(z_{i}^{(v_{1})}, z_{i}^{(v_{2})})

. Embeddings of different entities

(z_{i}^{(v_{1})}, z_{j}^{(v_{2})})

with

i \neq j

are treated as negative pairs. To stabilize training and increase diversity, negatives are drawn from the entire mini-batch.

We adopt an InfoNCE-based contrastive objective to maximize mutual information between views. For each positive pair

(z_{i}^{(v_{1})}, z_{i}^{(v_{2})})

, the loss is defined as:

L_{v_{1}, v_{2}} = - \sum_{i = 1}^{N} log \frac{exp (sim (z_{i}^{(v_{1})}, z_{i}^{(v_{2})}) / τ)}{\sum_{j = 1}^{N} exp (sim (z_{i}^{(v_{1})}, z_{j}^{(v_{2})}) / τ)},

(27)

where

sim (z_{a}, z_{b}) = z_{a}^{⊤} z_{b}

denotes cosine similarity (as embeddings are

ℓ_{2}

-normalized) and

τ > 0

is a temperature hyperparameter controlling distribution sharpness. The overall cross-view contrastive loss aggregates all view pairs:

L_{CVCL} = L_{q k, k c} + L_{q k, q q} + L_{k c, q q} .

(28)

To further align structural semantics, we introduce a consistency regularization term that penalizes embedding discrepancy for the same node across views:

L_{align} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{v_{1} \neq v_{2}} {∥z_{i}^{(v_{1})} - z_{i}^{(v_{2})}∥}_{2}^{2} .

(29)

This term ensures that the same question or skill remains close in the joint embedding space while retaining diversity through the InfoNCE loss.

Pure alignment may lead to over-smoothing, where embeddings from different views collapse to identical representations. To counter this, we introduce a decorrelation-based term encouraging orthogonality between view-specific components:

L_{decor} = \sum_{v_{1} \neq v_{2}} {∥C_{v_{1}, v_{2}} - I∥}_{F}^{2},

(30)

where

C_{v_{1}, v_{2}}

denotes the cross-correlation matrix between embeddings of two views and

{∥ \cdot ∥}_{F}

is the Frobenius norm. This term ensures that each view contributes distinct information rather than redundant signals.

The overall contrastive learning objective for MVQSN combines the above components:

L_{MVCL} = L_{CVCL} + α L_{align} + β L_{decor},

(31)

where

α

and

β

are balancing hyperparameters controlling the trade-off between alignment and diversity. The resulting embeddings

{Z_{q k}^{'}, Z_{k c}^{'}, Z_{q q}^{'}}

form a consistent yet complementary representation space across views. The aligned embeddings are subsequently passed to the fusion module (Section 2.5) for student performance prediction.

By aligning multiple relational perspectives through contrastive objectives, CVCL ensures that the learned embeddings capture consistent semantic structures across views while retaining their respective relational nuances. This process enhances both the generalization and interpretability of the student’s knowledge representations, providing a unified yet diversified embedding space that faithfully reflects the multifaceted nature of learning interactions.

2.5. Student Performance Prediction

After obtaining the cross-view aligned representations from the MVQSN, the final objective is to predict a student’s probability of correctly answering the next question. This process, termed student performance prediction (SPP), leverages the fused multi-view embeddings of questions and skills, along with the temporal context of each student’s interaction history. The SPP module bridges the relational representation learning stage and the sequential knowledge tracing task, thereby forming the predictive core of MVQSN.

Student responses are inherently sequential and context-dependent: their likelihood of answering a question correctly depends on both the latent knowledge state accumulated from past interactions and the difficulty and skill requirements of the current question. Thus, while the multi-view encoders and cross-view contrastive learning ensure a semantically rich representation space, these embeddings must be temporally contextualized through a sequence modeling mechanism to enable accurate knowledge tracing and response prediction.

Let

z_{q k, i}

,

z_{k c, i}

, and

z_{q q, i}

denote the aligned embeddings of the i-th question obtained from the three graph views (question–skill, skill complementarity, and question co-response). To integrate these heterogeneous signals into a unified representation, we employ an adaptive attention-based fusion mechanism that dynamically adjusts the contribution of each view according to the question’s relational importance.

Formally, the attention weights are computed as:

α_{i}^{(v)} = \frac{exp (u^{⊤} tanh (W_{a} z_{i}^{(v)} + b_{a}))}{\sum_{v^{'} \in {q s, s c, q q}} exp (u^{⊤} tanh (W_{a} z_{i}^{(v^{'})} + b_{a}))},

(32)

where

u

,

W_{a}

, and

b_{a}

are learnable parameters. The fused question representation is then obtained by weighted aggregation:

z_{i} = \sum_{v \in {q k, k c, q q}} α_{i}^{(v)} z_{i}^{(v)} .

(33)

This mechanism adaptively emphasizes the most informative view for each question, e.g., prioritizing

G_{q k}

for skill-centric questions or

G_{q q}

for questions with strong inter-question dependencies.

To further prevent over-dominance of a single view, we introduce a gated residual term:

z_{i}^{'} = z_{i} + σ (W_{g} [z_{q k, i}; z_{k c, i}; z_{q q, i}] + b_{g}),

(34)

where

σ (\cdot)

is the sigmoid function and

[\cdot; \cdot]

denotes vector concatenation. This gating operation preserves view-specific details while maintaining global semantic consistency.

Each student s is represented by a chronological sequence of interactions:

S_{s} = {(q_{1}, r_{1}), (q_{2}, r_{2}), \dots, (q_{T}, r_{T})},

(35)

where

q_{t}

is the question index and

r_{t} \in {0, 1}

denotes the correctness label. We construct the input embedding at timestep t as:

x_{t} = [z_{q_{t}}^{'} | | e_{r_{t}}],

(36)

where

e_{r_{t}}

is a correctness embedding that encodes whether the student answered correctly or incorrectly.

To model temporal dependencies, we adopt a transformer-based encoder [18,19] with self-attention [20] layers that capture long-range dependencies between past interactions. Given the input sequence

X = [x_{1}, x_{2}, \dots, x_{T}]

, the encoder produces contextualized hidden states:

H = TransformerEncoder (X) = [h_{1}, h_{2}, \dots, h_{T}] .

(37)

where each

h_{t}

represents the student’s latent knowledge state at timestep t, integrating both historical performance and question semantics.

For each interaction

(q_{t}, r_{t})

, we aim to predict the probability

{\hat{r}}_{t + 1}

that the student correctly answers the next question

q_{t + 1}

. This is achieved by combining the student’s latent knowledge representation

h_{t}

and the fused embedding

z_{q_{t + 1}}^{'}

of the target question:

{\hat{r}}_{t + 1} = σ (W_{p} [h_{t}; z_{q_{t + 1}}^{'}] + b_{p}),

(38)

where

σ (\cdot)

denotes the sigmoid activation and

W_{p}

and

b_{p}

are learnable parameters.

The student performance prediction module is trained using the binary cross-entropy loss between predicted and observed responses:

L_{pred} = - \frac{1}{N_{s}} \sum_{s = 1}^{S} \sum_{t = 1}^{T_{s} - 1} [r_{t + 1} log {\hat{r}}_{t + 1} + (1 - r_{t + 1}) log (1 - {\hat{r}}_{t + 1})],

(39)

where S is the number of students and

T_{s}

is the number of interactions for student s.

The overall loss for MVQSN integrates all training objectives:

L_{total} = L_{pred} + λ L_{MVCL},

(40)

where

λ

is a weighting factor balancing representation alignment and predictive accuracy.

Through adaptive fusion, sequential encoding, and predictive modeling, the student performance prediction module unifies the multi-view relational embeddings into a temporally grounded prediction framework. This design closes the loop between structural knowledge representation and sequential student modeling, enabling MVQSN to achieve high predictive accuracy in knowledge tracing tasks.

3. Experiments

3.1. Datasets

We evaluate the proposed MVQSN on three widely used benchmark datasets for knowledge tracing: ASSIST2009, EdNet-KT1, and EdNet-KT2. These datasets cover a broad spectrum of educational settings in terms of question volume, knowledge granularity, and student population scale, allowing us to comprehensively assess the generalizability of MVQSN. Table 1 summarizes the key statistics. The three datasets are described below:

ASSIST2009 (https://sites.google.com/site/assistmentsdata/home/2009-2010-assistment-data, accessed on 1 May 2025) is a classic K-12 mathematics dataset collected from the ASSISTments online tutoring system. Following standard preprocessing protocols, we exclude entries without skill annotations and remove scaffolding items to ensure data consistency.
EdNet-KT1 (https://github.com/riiid/ednet, accessed on 1 May 2025) is a large-scale dataset from the Santa learning platform. It contains rich interaction logs between students and questions over multiple sessions. In line with prior work [21], we randomly sample 5000 students for efficient evaluation.
EdNet-KT2 (https://github.com/riiid/ednet, accessed on 1 May 2025) extends EdNet-KT1 with additional interaction records collected in a later period. We apply identical filtering and sampling strategies to maintain comparability.

The question–skill mappings provided in these datasets are utilized to construct the question–skill bipartite graph, while the skill complementarity view and the question co-response view are derived from the statistical co-occurrence and aggregated answering patterns, respectively. This ensures that all graph structures are built solely from the available metadata, without requiring any external information.

3.2. Implementation Details

To ensure experimental fairness and reproducibility, all datasets are preprocessed using a consistent pipeline. Each student’s response sequence is segmented to a fixed maximum length of 50 interactions, following the practice of recent models such as AKT and LPKT-S. Sequences shorter than this threshold are zero-padded, while longer sequences are truncated chronologically. Students are randomly divided into training, validation, and testing sets with a 70%, 10%, and 20% ratio, respectively. All reported results are averaged over five independent runs to mitigate random fluctuations.

For hyperparameter settings, the embedding dimension for questions and skills is set to

d = 64

. Each graph view encoder contains two GNN layers with residual connections and layer normalization. The weight

λ

balancing prediction and contrastive losses is empirically set to 0.05. The learning rate and dropout rate are initialized as 0.001 and 0.5, respectively, and optimized using the Adam optimizer. The attention fusion network employs two hidden layers with ReLU activations [2]. The Transformer encoder for temporal modeling contains two layers, four attention heads, and a hidden size of 64. All models are trained on a single NVIDIA RTX 4090 GPU for 100 epochs with early stopping based on validation AUC.

For baseline models, we use their publicly available implementations and tune hyperparameters on the validation set for optimal performance. All experiments are conducted under identical data splits and evaluation protocols.

3.3. Evaluation Metrics

Following standard practice in the knowledge tracing literature, we cast the response prediction task as a binary classification problem. Each student’s response is labeled 1 for correct and 0 for incorrect. We employ four complementary metrics to comprehensively evaluate performance:

AUC (Area Under the ROC Curve). Measures the probability that a model ranks a randomly chosen correct response higher than an incorrect one.
ACC (Accuracy). Reflects the overall proportion of correctly predicted responses.
MAE (Mean Absolute Error). Quantifies the mean absolute deviation between predicted probabilities and actual outcomes.
RMSE (Root Mean Square Error). Penalizes large deviations more heavily, capturing the robustness of model predictions.

Higher AUC and ACC values indicate better classification performance, while lower MAE and RMSE values reflect more precise probabilistic estimation. To evaluate statistical reliability, we conduct paired t-tests across all metrics with a 99% confidence level, confirming that observed improvements are statistically significant (

p < 0.01

).

3.4. Baseline Methods

We compare MVQSN against 14 representative knowledge tracing models spanning multiple architectural paradigms:

Recurrent-based: DKT [5], DKT-Q (question-based variant), DKT-Forgetting [22], and KT-Forgetting (question-based variant).
Factorization-based: KTM [23], serving as a non-deep baseline incorporating multiple features.
Attention-based: SAKT [8], AKT [9], and HawkesKT [24].
Graph-based: GIKT [12], GFLDKT [25], and KMKT [14].
Learning behavior modeling-based: CoKT [26], IEKT [27], LPKT-S [28], and KMKT [14].

These baselines collectively represent the evolutionary trajectory of knowledge tracing models—from sequential recurrent formulations to graph-enhanced and learning behavior modeling frameworks. Our proposed MVQSN distinguishes itself by introducing a unified multi-view relational learning paradigm, in which question–skill dependencies, skill complementarity, and question co-response patterns are jointly modeled and aligned through contrastive learning.

3.5. Overall Performance

The comprehensive performance comparison between our proposed MVQSN model and all baseline methods on the three datasets is presented in Table 2 and Table 3. We observe several key findings regarding overall accuracy, robustness, and the effectiveness of our architectural design. In particular, MVQSN consistently achieves the highest AUC and ACC scores across datasets, along with the lowest MAE and RMSE. This indicates that our model not only improves predictive accuracy but also better captures the uncertainty inherent in student learning behaviors. Such consistent superiority highlights the generalization capability of our approach under varying data distributions and student populations.

The improvements over classical sequence-based models such as DKT and DKVMN demonstrate the necessity of incorporating structural relations between questions and skills rather than relying solely on temporal sequences. DKT-based models are limited to sequential dependencies, overlooking the inherent hierarchical and relational structures within educational data. By contrast, MVQSN introduces a question–skill bipartite graph to explicitly encode these relations, allowing message propagation between conceptually relevant entities. This structural augmentation enhances representation richness and improves the model’s ability to infer unobserved student knowledge from sparse interactions.

Moreover, compared to recent graph-based methods such as GIKT, GFLDKT, and KMKT, our multi-view architecture provides significant gains, particularly in AUC and ACC metrics. These improvements stem from the introduction of two complementary relational views—namely, the skill complementarity view and the question co-response view. The former captures inter-skill synergies derived from co-occurrence patterns in student responses, while the latter models implicit correlations between questions answered by similar cohorts of students. By jointly leveraging these views, MVQSN effectively captures both conceptual-level and behavioral-level dependencies that single-view graph models fail to represent.

Finally, the multi-view contrastive learning module further enhances model robustness and interpretability by aligning representations across heterogeneous relational views while maintaining view-specific information. This mechanism encourages embeddings from different views to preserve their unique semantics yet remain consistent at the latent concept level. The attention-guided fusion layer subsequently integrates these embeddings into a unified representation that dynamically emphasizes the most informative view for each interaction. As a result, MVQSN achieves a more discriminative and stable estimation of student knowledge states, leading to substantial improvements in predictive performance across all datasets.

3.6. Ablation Study

To further investigate the contributions of each component in MVQSN, we conduct comprehensive ablation experiments on Assist2009 and EdNet1, as summarized in Table 4 and Table 5. The full model includes four key components: multi-view graph construction (MV), cross-view contrastive learning (CL), attention-guided fusion (AF), and view-specific encoders (VEs). By systematically removing individual modules or combinations thereof, we aim to quantify their influence on predictive performance and demonstrate the effectiveness of our architectural design.

First, the removal of the multi-view graph construction module (• MV) results in the most pronounced performance degradation across all metrics. Specifically, AUC and ACC decrease, while MAE and RMSE increase relative to the full model. This indicates that capturing relational structures among questions and skills, as well as the complementary correlations between skills and co-response patterns among questions, is essential for accurately modeling students’ latent knowledge states. Without this structural information, the model effectively degenerates into a sequential predictor, limiting its ability to infer unobserved interactions.

Second, ablating the cross-view contrastive learning module (

★

CL) leads to noticeable drops in AUC and ACC, accompanied by increases in MAE and RMSE. This demonstrates that aligning representations across heterogeneous views while preserving view-specific information significantly enhances the consistency and robustness of learned embeddings. The performance decline is more evident in EdNet2, where the diversity of student responses and question characteristics is higher, suggesting that contrastive learning is particularly beneficial in datasets with complex relational structures.

Third, the attention-guided fusion († AF) and view-specific encoder (‡ VE) also contribute substantially to overall performance. Removing AF reduces the model’s ability to selectively integrate the most informative view for each student–question interaction, while removing the VE diminishes the capacity to capture unique characteristics inherent in each view. Both modifications lead to moderate declines in AUC and ACC, along with corresponding increases in MAE and RMSE, indicating that these components jointly ensure that the final knowledge state embedding is discriminative and stable.

Finally, the ablation results in Table 6 show that each view in MVQSN contributes distinct and irreplaceable information: the question–skill bipartite graph provides the fundamental mapping between questions and the skills they require, the skill complementarity view captures higher-order relationships among skills that frequently co-occur, and the question co-response view models latent dependencies between questions reflected in students’ response patterns. Removing any view breaks this complementary structure and weakens cross-view alignment, leading to consistent performance declines, while integrating all three views yields the most complete representation of students’ knowledge states.

Overall, the ablation results confirm that each component of MVQSN plays a critical and complementary role. The multi-view construction provides rich relational context, cross-view contrastive learning enforces representation consistency, and the attention-guided fusion with view-specific encoders ensures effective integration of heterogeneous information. Together, these mechanisms enable MVQSN to achieve superior predictive accuracy and precision, validating the necessity of our proposed design choices across all evaluation metrics.

3.7. Hyperparameter Sensitivity Analysis

To investigate the sensitivity of our model to the contrastive loss weight

λ

, we conduct a series of experiments on the Assist2009 dataset, where

λ

is varied within

{0.001, 0.01, 0.03, 0.05, 0.07, 0.1}

. The corresponding results of AUC, ACC, MAE, and RMSE are presented in Figure 2. As observed, the model performance exhibits a clear trend of first increasing and then decreasing with the growth of

λ

, suggesting that an appropriate balance between the predictive objective and the contrastive objective is critical for stable optimization.

Specifically, when

λ = 0.05

, the model achieves the optimal results with an AUC of 0.8156, ACC of 0.7639, MAE of 0.3115, and RMSE of 0.4085. This setting ensures that the contrastive learning module provides sufficient discriminative guidance for refining the multi-view knowledge representations, while avoiding excessive regularization that could distort the predictive embeddings. When

λ

is too small (e.g., <0.01), the contrastive objective becomes underweighted, leading to weak alignment between different knowledge views and insufficient feature separation. Conversely, when

λ

exceeds 0.07, the contrastive loss begins to dominate the optimization process, introducing instability and slightly degrading the predictive accuracy.

Overall, these results indicate that the proposed model is relatively robust to moderate variations in

λ

, and the optimal setting (

λ = 0.05

) effectively balances representation discrimination and task-specific prediction. This balance enhances both convergence stability and generalization capability, highlighting the importance of contrastive regularization in learning consistent multi-view embeddings.

We further evaluate the impact of the number of graph convolution layers L in each view. As shown in Figure 3, increasing L from 1 to 3 steadily improves the prediction performance, with the best results achieved at

L = 3

(AUC = 0.8156, ACC = 0.7639). Further increasing L to 4 or 5 leads to a slight drop in performance, suggesting potential over-smoothing effects. This analysis demonstrates that MVQSN is robust to the choice of graph layer depth within a reasonable range.

3.8. Case Study Analysis

To investigate the relationships captured by different views, we compute cross-view embedding similarities for both questions and skills. Specifically, we compare embeddings from the question–skill bipartite graph with those from the skill complementarity view for skills, and embeddings from the question–skill bipartite graph with the question co-response view for questions. As shown in Figure 4, embeddings corresponding to the same entity across different views generally exhibit high similarity, while embeddings of different entities remain well separated. These results indicate that the proposed multi-view representations effectively capture complementary relational information, providing a foundation for downstream student knowledge modeling.

To illustrate the contribution of the question co-response view, we highlight an interesting relationship discovered in the Assist2009 dataset. The question co-response view identifies a strong latent association between questions on skill Addition and Subtraction of Fractions and questions on skill Conversion of Fractions, Decimals, and Percents, even though these questions are not linked by the same skill tag. This suggests that students who struggle with fraction operations also tend to make similar errors in converting between fractions, decimals, and percents, revealing latent behavioral patterns that are not captured by the traditional question–skill bipartite graph. Incorporating such relationships helps MVQSN better model nuanced student knowledge dynamics.

Figure 5 illustrates the predicted probabilities of correctly answering 20 questions for three representative students. The heatmap shows that MVQSN captures student-specific knowledge states, assigning higher probabilities to questions that students are likely to answer correctly, while a few mispredictions reflect realistic uncertainty. Different students exhibit distinct response patterns, and the model adapts its predictions accordingly, demonstrating sensitivity to both individual learning behaviors and latent question dependencies. Overall, this case study highlights the interpretability and effectiveness of our approach in modeling nuanced student learning trajectories.

4. Conclusions

In this paper, we propose a novel model MVQSN for modeling students’ evolving knowledge states in intelligent education systems. Unlike existing knowledge tracing approaches that rely on single-view representations, MVQSN constructs three complementary relational views: a question–skill bipartite graph capturing direct associations, a skill complementarity view reflecting synergistic skill relationships, and a question co-response view modeling latent question dependencies from aggregated student responses. Each view is encoded with graph neural networks, and a cross-view contrastive learning module aligns the embeddings while preserving view-specific information. An attention-guided fusion mechanism then generates unified knowledge state embeddings for precise student performance prediction.

Extensive experiments on three publicly available datasets demonstrate that MVQSN consistently outperforms state-of-the-art baselines across multiple evaluation metrics, confirming the effectiveness of multi-view representation learning and cross-view alignment. Ablation studies further highlight the contribution of each component, validating the importance of the proposed views and fusion strategy. A case study illustrates that MVQSN can capture subtle differences in students’ responses and reveal latent question associations, providing interpretable insights into learning patterns.

Future work will explore extending MVQSN to incorporate temporal dynamics more explicitly, integrating additional modalities such as textual explanations or video-based learning behaviors, and applying the multi-view framework to other educational tasks such as curriculum recommendation and adaptive learning path planning. We believe that the proposed multi-view modeling paradigm provides a promising direction for advancing personalized education and interpretable knowledge tracing.

Author Contributions

Conceptualization, J.L. and C.S.; methodology, J.L. and D.X.; software, S.M. and W.H.; validation, Y.C. and M.S.; formal analysis, J.L. and C.L.; resources, C.S.; data curation, M.S. and Y.C.; writing—original draft preparation, J.L.; writing—review and editing, C.S. and Y.D.; supervision, C.S.; project administration, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by Humanity and Social Science Youth Foundation Project of Ministry of Education in China under Grant 25YJCZH122, in part by Key Areas Special Project for Regular Higher Education Institutions in Guangdong Province under Grant 2025ZDZX1023, in part by Guangdong Key Discipline Research Capacity Building Project under Grant (2024ZDJS055, 2024ZDJS056), in part by Ganzhou City Unveiling and Leading Project under Grant 2023ULGX0004, in part by Department of Science and Technology of Jiangxi Province under Grant 20232BCJ22050, and in part by Guangzhou Municipal Education Bureau under Grant 2024312000.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.; Mao, S.; Qin, Y.; Wang, F.; Jiang, Y. Hyperbolic Hypergraph Transformer with Knowledge State Disentanglement for Knowledge Tracing. IEEE Trans. Knowl. Data Eng. 2025, 37, 4677–4690. [Google Scholar] [CrossRef]
Li, J.; Deng, Y.; Qin, Y.; Mao, S.; Jiang, Y. Dual-Channel Adaptive Scale Hypergraph Encoders with Cross-View Contrastive Learning for Knowledge Tracing. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 6752–6766. [Google Scholar] [CrossRef]
Lv, X.; Wang, G.; Chen, J.; Su, H.; Dong, Z.; Zhu, Y.; Liao, B.; Wu, F. Debiased Cognition Representation Learning for Knowledge Tracing. ACM Trans. Inf. Syst. 2025, 43, 127. [Google Scholar] [CrossRef]
Shen, S.; Liu, Q.; Huang, Z.; Zheng, Y.; Yin, M.; Wang, M.; Chen, E. A Survey of Knowledge Tracing: Models, Variants, and Applications. IEEE Trans. Learn. Technol. 2024, 17, 1858–1879. [Google Scholar] [CrossRef]
Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep Knowledge Tracing. In Proceedings of the Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 505–513. [Google Scholar]
Zhu, Z.; Tian, Y.; Sun, J. Antenna Modeling Based on Image-CNN-LSTM. IEEE Antennas Wirel. Propagat. Lett. 2024, 23, 2738–2742. [Google Scholar] [CrossRef]
Yu, G.; Xie, Z.; Zhou, G.; Zhao, Z.; Huang, J.X. Exploring long- and short-term knowledge state graph representations with adaptive fusion for knowledge tracing. Inf. Process. Manag. 2025, 62, 104074. [Google Scholar] [CrossRef]
Pandey, S.; Karypis, G. A Self Attentive model for Knowledge Tracing. In Proceedings of the International Conference on Educational Data Mining (EDM), Montreal, QC, Canada, 2–5 July 2019; pp. 2330–2339. [Google Scholar]
Ghosh, A.; Heffernan, N.; Lan, A.S. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, 6–10 July 2020; pp. 2330–2339. [Google Scholar]
Wang, X.; Chen, L.; Zhang, M. Deep Attentive Model for Knowledge Tracing. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, USA, 7–14 February 2023; pp. 10192–10199. [Google Scholar]
Mao, S.; Zhan, J.; Li, J.; Jiang, Y. Knowledge Structure-Aware Graph-Attention Networks for Knowledge Tracing. In Proceedings of the Knowledge Science, Engineering and Management (KSEM), Singapore, 6–8 August 2022; pp. 309–321. [Google Scholar]
Yang, Y.; Shen, J.; Qu, Y.; Liu, Y.; Wang, K.; Zhu, Y.; Zhang, W.; Yu, Y. GIKT: A graph-based interaction model for knowledge tracing. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Ghent, Belgium, 14–18 September 2020; pp. 299–315. [Google Scholar]
Qin, P.; Chen, W.; Zhang, M.; Li, D.; Feng, G. CC-GNN: A Clustering Contrastive Learning Network for Graph Semi-Supervised Learning. IEEE Access 2024, 12, 71956–71969. [Google Scholar] [CrossRef]
Li, J.; Deng, Y.; Mao, S.; Qin, Y.; Jiang, Y. Knowledge-Associated Embedding for Memory-Aware Knowledge Tracing. IEEE Trans. Comput. Soc. Syst. 2024, 11, 4016–4028. [Google Scholar] [CrossRef]
Liu, Z.; Guo, T.; Liang, Q.; Hou, M.; Zhan, B.; Tang, J.; Luo, W.; Weng, J. Deep Learning Based Knowledge Tracing: A Review, a Tool and Empirical Studies. IEEE Trans. Knowl. Data Eng. 2025, 37, 4512–4536. [Google Scholar] [CrossRef]
Liu, Z.; Chen, J.; Luo, W. Recent Advances on Deep Learning based Knowledge Tracing. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 1295–1296. [Google Scholar]
Deng, Y.; Bai, W.; Li, J.; Mao, S.; Jiang, Y. Improving semantic similarity computation via subgraph feature fusion based on semantic awareness. Eng. Appl. Artif. Intell. 2024, 136, 108947. [Google Scholar] [CrossRef]
Feng, Y.; Liu, P.; Du, Y.; Jiang, Z. Cross working condition bearing fault diagnosis based on the combination of multimodal network and entropy conditional domain adversarial network. J. Vib. Control 2024, 30, 5375–5386. [Google Scholar] [CrossRef]
Zhang, M.; Ji, A.; Zhou, C.; Ding, Y.; Wang, L. Real-time prediction of TBM penetration rates using a transformer-based ensemble deep learning model. Autom. Constr. 2024, 168, 105793. [Google Scholar] [CrossRef]
Xiang, D.; Zhou, Z.; Yang, W.; Wang, H.; Gao, P.; Xiao, M.; Zhang, J.; Zhu, X. A fusion framework with multi-scale convolution and triple-branch cascaded transformer for underwater image enhancement. Opt. Lasers Eng. 2025, 184, 108640. [Google Scholar] [CrossRef]
Shen, S.; Liu, Q.; Chen, E.; Huang, Z.; Huang, W.; Yin, Y.; Su, Y.; Wang, S. Learning Process-consistent Knowledge Tracing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event, Singapore, 14–18 August 2021; pp. 1452–1460. [Google Scholar]
Nagatani, K.; Zhang, Q.; Sato, M.; Chen, Y.Y.; Chen, F.; Ohkuma, T. Augmenting knowledge tracing by considering forgetting behavior. In Proceedings of the WWW’19: The Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 3101–3107. [Google Scholar]
Vie, J.; Kashima, H. Knowledge Tracing Machines: Factorization Machines for Knowledge Tracing. Proc. AAAI Conf. Artif. Intell. 2019, 33, 750–757. [Google Scholar] [CrossRef]
Wang, C.; Ma, W.; Zhang, M.; Lv, C.; Wan, F.; Lin, H.; Tang, T.; Liu, Y.; Ma, S. Temporal cross-effects in knowledge tracing. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual Event, Israel, 8–12 March 2021; pp. 517–525. [Google Scholar]
Zhao, W.; Xia, J.; Jiang, X.; He, T. A novel framework for deep knowledge tracing via gating-controlled forgetting and learning mechanisms. Inf. Process. Manag. 2023, 60, 103114. [Google Scholar] [CrossRef]
Long, T.; Qin, J.; Shen, J.; Zhang, W.; Xia, W.; Tang, R.; He, X.; Yu, Y. Improving Knowledge Tracing with Collaborative Information. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event, AZ, USA, 21–25 February 2022; pp. 599–607. [Google Scholar]
Long, T.; Liu, Y.; Shen, J.; Zhang, W.; Yu, Y. Tracing Knowledge State with Individual Cognition and Acquisition Estimation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021; pp. 173–182. [Google Scholar]
Shen, S.; Chen, E.; Liu, Q.; Huang, Z.; Huang, W.; Yin, Y.; Su, Y.; Wang, S. Monitoring Student Progress for Learning Process-Consistent Knowledge Tracing. IEEE Trans. Knowl. Data Eng. 2023, 35, 8213–8227. [Google Scholar] [CrossRef]

Figure 1. The MVQSN framework overview.

Figure 2. The impact of the cross-view contrastive learning constraint

λ

on MVQSN.

Figure 2. The impact of the cross-view contrastive learning constraint

λ

on MVQSN.

Figure 3. The impact of the layer L in the graph neural network for MVQSN.

Figure 4. Cross-view embedding similarity visualization.

Figure 5. Case study. Visualization of predictions by three students on 20 questions. ∘ denote samples with the ground-truth label of 1, while • denote those with the ground-truth label of 0.

Table 1. Statistics of the three benchmark datasets used for evaluation.

Statistics	ASSIST2009	EdNet-KT1	EdNet-KT2
# of students	3841	5000	5000
# of questions	15,911	11,718	11,427
# of skills	123	189	188
# of records	258,896	498,374	394,986
Avg. questions per skill	156.1	136.8	133.7
Avg. skills per question	1.2	2.2	2.2

Table 2. Performance comparison on the Assist2009 and EdNet1 datasets in terms of AUC, ACC, MAE, and RMSE. Best performances are in bold and second-best performances are underlined. * indicates statistical significance with p-value

< 0.01

based on the t-test.

Table 2. Performance comparison on the Assist2009 and EdNet1 datasets in terms of AUC, ACC, MAE, and RMSE. Best performances are in bold and second-best performances are underlined. * indicates statistical significance with p-value

< 0.01

based on the t-test.

Model	Assist2009				EdNet1
Model	AUC	ACC	MAE	RMSE	AUC	ACC	MAE	RMSE
DKT	0.7484	0.7170	0.3685	0.4366	0.6877	0.6390	0.4440	0.4700
DKT-Q	0.7379	0.7107	0.3782	0.4403	0.6913	0.6426	0.4314	0.4691
KTM	0.7321	0.6878	0.3904	0.4463	0.7237	0.6716	0.4052	0.4608
DKTForgetting	0.7468	0.7075	0.3746	0.4393	0.6815	0.6349	0.4272	0.4747
DKTForgetting-Q	0.7475	0.7081	0.3708	0.4382	0.6929	0.6455	0.4305	0.4691
SAKT	0.6983	0.6852	0.4078	0.4570	0.6607	0.6210	0.4489	0.4781
AKT	0.7210	0.6897	0.3832	0.4568	0.7190	0.6678	0.3971	0.4670
GIKT	0.7948	0.7424	0.3237	0.4254	0.7420	0.6818	0.3980	0.4563
HawkesKT	0.7346	0.6942	0.3826	0.4482	0.7166	0.6674	0.4174	0.4642
IEKT	0.7727	0.7260	0.3494	0.4306	0.7319	0.6754	0.4094	0.4566
CoKT	0.7703	0.7223	0.3710	0.4296	0.7244	0.6677	0.4229	0.4584
GFLDKT	0.7378	0.6801	0.3809	0.4497	0.7342	0.6743	0.4135	0.4564
LPKT-S	0.7832	0.7319	0.3426	0.4339	0.7518	0.6877	0.3695	0.4542
KMKT	0.8129	0.7559	0.3139	0.4117	0.7542	0.7007	0.3542	0.4495
MVQSN	${0.8156}^{*}$	${0.7639}^{*}$	0.3115	${0.4086}^{*}$	${0.7611}^{*}$	${0.7052}^{*}$	${0.3496}^{*}$	${0.4437}^{*}$

Table 3. Performance comparison on the EdNet2 dataset in terms of AUC, ACC, MAE, and RMSE. Best performances are in bold and second-best performances are underlined. * indicates statistical significance with p-value

< 0.01

.

Table 3. Performance comparison on the EdNet2 dataset in terms of AUC, ACC, MAE, and RMSE. Best performances are in bold and second-best performances are underlined. * indicates statistical significance with p-value

< 0.01

.

Model	AUC	ACC	MAE	RMSE
DKT	0.6140	0.5824	0.4746	0.4895
DKT-Q	0.6050	0.5703	0.4709	0.4947
KTM	0.6509	0.6082	0.4629	0.4834
DKTForgetting	0.6183	0.5864	0.4678	0.4917
DKTForgetting-Q	0.6265	0.5910	0.4652	0.4888
SAKT	0.5677	0.5129	0.4941	0.5000
AKT	0.6327	0.5921	0.4613	0.4892
GIKT	0.6512	0.6153	0.4556	0.4868
HawkesKT	0.6402	0.6007	0.4685	0.4855
IEKT	0.6530	0.6143	0.4539	0.4902
CoKT	0.6387	0.6042	0.4590	0.4857
GFLDKT	0.6556	0.6122	0.4670	0.4815
LPKT-S	0.6530	0.6135	0.4583	0.4856
KMKT	0.7035	0.6501	0.4426	0.4773
MVQSN	${0.7116}^{*}$	${0.6602}^{*}$	${0.4340}^{*}$	${0.4695}^{*}$

Table 4. Ablation study results on Assist2009 and EdNet. The key components are multi-view graph construction (• MV), cross-view contrastive learning (

★

CL), attention-guided fusion († AF), and view-specific encoders (‡ VEs). Best performances are in bold.

Table 4. Ablation study results on Assist2009 and EdNet. The key components are multi-view graph construction (• MV), cross-view contrastive learning (

★

CL), attention-guided fusion († AF), and view-specific encoders (‡ VEs). Best performances are in bold.

Model	Component				Assist2009				EdNet1
Model	MV	CL	AF	VE	AUC	ACC	MAE	RMSE	AUC	ACC	MAE	RMSE
w/o •		✓	✓	✓	0.7901	0.7354	0.3367	0.4523	0.7452	0.6895	0.3621	0.4826
w/o $★$	✓		✓	✓	0.7983	0.7421	0.3328	0.4478	0.7510	0.6952	0.3589	0.4762
w/o †	✓	✓		✓	0.8025	0.7468	0.3302	0.4451	0.7543	0.6991	0.3567	0.4728
w/o ‡	✓	✓	✓		0.8010	0.7450	0.3315	0.4463	0.7531	0.6982	0.3572	0.4740
w/o $• ★$			✓	✓	0.7852	0.7308	0.3389	0.4552	0.7420	0.6871	0.3635	0.4853
w/o $• †$		✓		✓	0.7878	0.7325	0.3374	0.4537	0.7438	0.6886	0.3624	0.4837
w/o $★ ‡$	✓		✓		0.7896	0.7343	0.3362	0.4519	0.7461	0.6904	0.3611	0.4818
w/o $† ‡$	✓	✓			0.7905	0.7358	0.3359	0.4512	0.7468	0.6910	0.3608	0.4812
MVQSN (Full)	✓	✓	✓	✓	0.8156	0.7639	0.3115	0.4085	0.7611	0.7052	0.3496	0.4437

Table 5. Ablation study results on EdNet2. The key components are multi-view graph construction (• MV), cross-view contrastive learning (

★

CL), attention-guided fusion († AF), and view-specific encoders (‡ VEs). Best performances are in bold.

Table 5. Ablation study results on EdNet2. The key components are multi-view graph construction (• MV), cross-view contrastive learning (

★

CL), attention-guided fusion († AF), and view-specific encoders (‡ VEs). Best performances are in bold.

Model	MV	CL	AF	VE	AUC	ACC	MAE	RMSE
w/o •		✓	✓	✓	0.6931	0.6378	0.4512	0.4945
w/o $★$	✓		✓	✓	0.6994	0.6429	0.4480	0.4921
w/o †	✓	✓		✓	0.7008	0.6442	0.4472	0.4913
w/o ‡	✓	✓	✓		0.6997	0.6431	0.4478	0.4919
w/o $• ★$			✓	✓	0.6885	0.6329	0.4551	0.4976
w/o $• †$		✓		✓	0.6908	0.6347	0.4537	0.4963
w/o $★ ‡$	✓		✓		0.6920	0.6361	0.4524	0.4952
w/o $† ‡$	✓	✓			0.6925	0.6367	0.4520	0.4949
MVQSN (Full)	✓	✓	✓	✓	0.7116	0.6602	0.4340	0.4695

Table 6. Contributions of each view on Assist2009. The key views are question–skill bipartite graph (QS), skill complementarity view (SS), and question co-response view (QQ). Best performances are in bold.

View			Assist2009
QS	SS	QQ	AUC	ACC	MAE	RMSE
✓			0.7983	0.7421	0.3328	0.4478
	✓		0.7755	0.7211	0.3424	0.4635
		✓	0.7893	0.7319	0.3376	0.4559
✓	✓		0.8022	0.7464	0.3302	0.4439
✓		✓	0.8075	0.7514	0.3247	0.4325
	✓	✓	0.7974	0.7426	0.3319	0.4465
✓	✓	✓	0.8156	0.7639	0.3115	0.4085

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Xiang, D.; Li, C.; Mao, S.; Chen, Y.; Sun, M.; He, W.; Deng, Y.; Sun, C. Learning Student Knowledge States from Multi-View Question–Skill Networks. Symmetry 2025, 17, 2073. https://doi.org/10.3390/sym17122073

AMA Style

Li J, Xiang D, Li C, Mao S, Chen Y, Sun M, He W, Deng Y, Sun C. Learning Student Knowledge States from Multi-View Question–Skill Networks. Symmetry. 2025; 17(12):2073. https://doi.org/10.3390/sym17122073

Chicago/Turabian Style

Li, Jiawei, Dan Xiang, Chunlin Li, Shun Mao, Yuhuan Chen, Miao Sun, Wei He, Yuanfei Deng, and Chengli Sun. 2025. "Learning Student Knowledge States from Multi-View Question–Skill Networks" Symmetry 17, no. 12: 2073. https://doi.org/10.3390/sym17122073

APA Style

Li, J., Xiang, D., Li, C., Mao, S., Chen, Y., Sun, M., He, W., Deng, Y., & Sun, C. (2025). Learning Student Knowledge States from Multi-View Question–Skill Networks. Symmetry, 17(12), 2073. https://doi.org/10.3390/sym17122073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Learning Student Knowledge States from Multi-View Question–Skill Networks

Abstract

1. Introduction

2. Methods

2.1. Problem Definition

2.2. Overview of MVQSN

2.3. Multi-View Construction

2.3.1. Question–Skill Bipartite Graph

2.3.2. Skill Complementarity View

2.3.3. Question Co-Response View

2.4. Cross-View Contrastive Learning

2.5. Student Performance Prediction

3. Experiments

3.1. Datasets

3.2. Implementation Details

3.3. Evaluation Metrics

3.4. Baseline Methods

3.5. Overall Performance

3.6. Ablation Study

3.7. Hyperparameter Sensitivity Analysis

3.8. Case Study Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI