Integrating Ensemble Learning with Item Response Theory to Improve the Interpretability of Student Learning Outcome Tracing

Onyeke, Christian; Qian, Lijun; Obiomon, Pamela; Dong, Xishuang

doi:10.3390/app152312594

Open AccessArticle

Integrating Ensemble Learning with Item Response Theory to Improve the Interpretability of Student Learning Outcome Tracing

by

Christian Onyeke

,

Lijun Qian

,

Pamela Obiomon

and

Xishuang Dong

^*

Department of Electrical and Computer Engineering, Prairie View A&M University, Prairie View, TX 77446, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12594; https://doi.org/10.3390/app152312594

Submission received: 16 October 2025 / Revised: 10 November 2025 / Accepted: 19 November 2025 / Published: 27 November 2025

Download

Browse Figure

Versions Notes

Abstract

Student learning outcome (SLO) tracing aims to monitor students’ learning progress by predicting their likelihood of passing or failing courses using Deep Knowledge Tracing (DKT). However, conventional DKT models often lack interpretability, limiting their adoption in educational settings that require transparent decision-making. To address this challenge, this quantitative study proposes an interpretable ensemble framework that integrates Item Response Theory (IRT) with DKT. Specifically, multiple IRT-based DKT models are developed to capture student ability and item characteristics, and these models are combined using a bagging strategy to enhance predictive performance and robustness. The framework is evaluated on an SLO tracing dataset from Prairie View A&M University (PVAMU), a historically Black college and university (HBCU). Result analysis includes comparisons of evaluation metrics such as Area Under the Curve (AUC), accuracy (ACC), and precision across individual and ensemble models, as well as visualizations of student ability, item difficulty, and predicted probabilities to assess interpretability. Experimental results demonstrate that the ensemble approach consistently outperforms single models while providing clear, interpretable insights into student learning dynamics. These findings suggest that integrating ensemble methods with IRT can simultaneously improve prediction accuracy and transparency in SLO tracing.

Keywords:

interpretable AI; deep knowledge tracing; HBCU; item response theory

1. Introduction

In the era of personalized education and intelligent tutoring systems, understanding how students learn and predict their academic performance has become critical in educational data mining. Deep Knowledge Tracing (DKT) has emerged as a powerful framework for modeling a student’s knowledge progression over time, leveraging deep neural architectures to track sequences of learner interactions. These capabilities make DKT well-suited for adaptive learning environments, enabling real-time, personalized feedback aligned with each student’s unique trajectory [1].

In recent years, student learning outcome (SLO) tracing has attracted significant attention. This approach aims to monitor students’ learning progress by predicting their likelihood of passing or failing courses using DKT [2,3]. Ming et al. applied DKT techniques for SLO tracing at a historically Black college and university (HBCU), focusing on Science, Technology, Engineering, and Mathematics (STEM) education at PVAMU [3]. They further explored the use of generative artificial intelligence models to produce synthetic data, augmenting real datasets to enhance SLO tracing at PVAMU [2]. This approach offers the potential to identify at-risk students and enable proactive interventions aimed at improving retention and graduation rates in STEM education.

However, the lack of transparency in traditional DKT models poses a significant limitation, especially in educational contexts where trust and explainability are vital. As DKT models operate as black-box systems, it becomes difficult for educators to interpret model predictions or diagnose the underlying causes of student performance. To address this challenge, a few studies explored various explainable techniques in knowledge tracing. For instance, Huang et al. integrated cognitive learning theories and Multidimensional Item Response Theory (MIRT) to enhance interpretability of knowledge tracing [4]. In addition, Lu et al. utilized layer-wise relevance propagation (LRP) to visualize skill-response associations, quantifying the relevance of specific input sequences to model outputs [5]. Likewise, Li et al. proposed the Genetic Causal Explainer (GCE) framework to identify causal relationships between learning events and predicted outcomes using evolutionary algorithms [6].

This quantitative study aims to enhance the interpretability of student learning outcome (SLO) tracing models while improving their predictive performance. To achieve this, it integrates ensemble learning with Item Response Theory (IRT), a statistical framework that models the relationship between a learner’s latent ability (e.g., knowledge, skills, or traits) and the probability of correctly answering an item (e.g., a test question) [7]. Specifically, the DeepIRT model combines IRT with deep neural networks to capture temporal dependencies while retaining interpretable parameters such as item difficulty and student ability [8]. In this study, multiple IRT-based interpretable DKT models were first developed for SLO tracing and then combined using a bagging ensemble strategy to enhance generalizability and provide a more comprehensive view of student knowledge states. By aggregating predictions, the ensemble approach not only improves accuracy but also supports interpretability through visualizations of student ability, item difficulty, and predicted probabilities [9]. The proposed method was evaluated on an SLO tracing dataset from PVAMU, a historically Black college and university (HBCU). Experimental results show that the approach consistently outperforms individual models across key metrics, including Area Under the Curve (AUC), accuracy, and precision. These outcomes demonstrate that integrating ensemble methods with IRT can simultaneously advance predictive performance and transparency, guiding future research on interpretable SLO tracing and supporting educators in understanding student learning dynamics.

The contributions of this study are as below:

(1): Compared with DeepIRT and SLO at PVAMU [2,3], this study integrates cognitive theories with ensemble deep learning techniques to develop a novel and interpretable SLO tracing framework. Unlike DeepIRT, which focuses on individual IRT-based DKT models, and the SLO implementation at PVAMU, which also employs a single DKT model, this paper constructs multiple IRT-based interpretable DKT models for SLO tracing. The outputs of these models are then aggregated using a bagging ensemble approach to enhance generalizability and provide a more comprehensive representation of students’ knowledge states. Furthermore, visualization tools are utilized to depict the relationships among student ability, item difficulty, and predicted probabilities, thereby improving the interpretability of the SLO tracing process. These visualizations function as transparent and actionable decision-support tools for educators and administrators, allowing for a deeper understanding of student learning patterns.
(2): Comprehensive validation of the proposed method on a PVAMU dataset demonstrates its effectiveness in accurately predicting student learning outcomes. The dataset spans multiple colleges, including the College of Engineering and the College of Science and Arts, and covers several departments such as Electrical and Computer Engineering, Civil Engineering, and Computer Science. Experimental results show that the proposed approach consistently outperforms individual models across key metrics, including AUC, accuracy, and precision.

2. Related Work

Over the last decade, deep learning models have significantly advanced the field of educational data mining, particularly through applications like Deep Knowledge Tracing (DKT) and efforts to extend it to Student Learning Outcome (SLO) prediction. While overlapping in purpose, these two areas serve distinct goals: DKT focuses on fine-grained, temporal prediction of student mastery at the skill level, whereas SLO models aim to assess broader academic goals at the course or institutional level. Both domains increasingly emphasize not only predictive accuracy but also transparency and interpretability, which are essential for informed decision-making in educational settings. Despite these advances, there remains a lack of research applying interpretable models to SLO tracing specifically in Historically Black Colleges and Universities (HBCUs), representing a critical gap in the literature.

DKT was originally introduced using Long Short-Term Memory (LSTM) networks to model the temporal progression of student knowledge [10]. Subsequent extensions, such as the Dynamic Key-Value Memory Network (DKVMN) [11], incorporated multi-relational information—including exercise-concept relationships and memory-based representations of student forgetting—to improve both predictive performance and interpretability. Transformer-based architectures further expanded DKT’s capabilities. For example, Lu et al. [12] integrated curriculum structure and student interaction data using self-attention mechanisms, enabling strong predictive performance alongside interpretable attention visualizations. Knowledge Component Deep Knowledge Tracing (KCDKT) aligns latent student embeddings with known skill hierarchies, facilitating concept-level interpretability, while stage-aware models like Learning Intention-Aware Knowledge Tracing for Learning Stage (ISKT) [13] capture transitions in learning intentions across cognitive stages.

Efforts to adapt DKT for SLO tracing at HBCUs have been limited but promising. Kuo et al. [2] demonstrated the feasibility of applying DKT variants, including Deep Knowledge Tracing plus (DKT+), DKVMN, Self Attentive Knowledge Tracing (SAKT), and KQN, to trace SLOs in STEM courses at Prairie View A&M University (PVAMU), revealing DKVMN’s superior performance in identifying at-risk students from sparse assessment data. Extending this work, Kuo et al. [3] applied Tabular Generative AI to synthesize student interaction data, enhancing model robustness and improving predictive metrics such as AUC. Despite these advances, few studies have systematically explored the integration of interpretable modeling frameworks such as Multi-Item Response Theory (MIRT), Layer-wise Relevance Propagation (LRP), or Generalized Cognitive Estimation (GCE) in SLO tracing, leaving a gap in methods that can provide actionable insights for educators.

Recent research in explainable AI has begun to address this gap. Chen et al. [14] introduced question-centric interpretability by aligning predictions with item-level characteristics via IRT-inspired layers, while Response Influence-based Counterfactual Knowledge Tracing (RCKT) [15] enables educators to explore “what-if” scenarios based on prior student responses. Attention mechanisms in Transformer-based DKT variants further support transparency by highlighting influential past interactions. Collectively, these approaches indicate a shift toward inherently interpretable models rather than post hoc explanations, but their adoption in SLO tracing at HBCUs remains limited.

To address these gaps, this study integrates Ensemble Learning with IRT to enhance interpretability and predictive performance in SLO tracing. By combining 1-Parameter Logistic (1PL), 2-Parameter Logistic (2PL), and DeepIRT models, the ensemble framework leverages complementary strengths of multiple models. Evaluation on academic data from PVAMU demonstrates that the ensemble consistently outperforms individual models across metrics including AUC, accuracy, precision, and F1-score, while providing interpretable insights into student ability, item difficulty, and predicted outcomes. These findings contribute to both methodological knowledge—highlighting how DKT, IRT, and ensemble methods can be combined for SLO tracing—and practical understanding of student learning in HBCU contexts.

3. Task Definition

This study focuses on tracking student learning outcomes (SLO) by predicting course success or failure using DKT techniques. The objective of this task is to predict passing or failing a sequence of upcoming courses, given the student’s prior course history:

f (S) = P (C_{t 1}, \dots, C_{n}| S)

(1)

Here, the future course sequence is denoted by (C_t₊₁, …, C_n), while S = (C₁,C₂, …, C_t) represents the historical course sequence. Potentially, this task can be leveraged to improve retention and graduation rates at HBCUs by monitoring student learning trajectories and identifying early indicators of academic risk, enabling timely and targeted interventions.

4. Methods

4.1. Item Response Theory (IRT)

Item Response Theory (IRT) is a foundational framework in psychometrics and educational assessment that models the interaction between a student’s latent ability and the characteristics of test items. Unlike traditional scoring methods that treat all test items equally, IRT enables nuanced and individualized measurement by capturing how different items function across varying levels of student proficiency. This feature makes IRT particularly valuable in adaptive testing, intelligent tutoring systems, and interpretable machine learning applications in education [16,17]. At its core, IRT models the probability that a student with a given latent trait, typically referred to as “ability” will correctly respond to a specific test item. This formulation incorporates item-level properties such as difficulty, discrimination, and guessing factors to evaluate performance [18].

This quantitative study focused on two types of IRT: (1) The One Parameter Logistic Model (1PL) is a simplified form of IRT used to model the probability that a learner will correctly answer an item (e.g., a test question). This model, also known as the Rasch model, is primarily focused on estimating how difficult each item is and how able each student is, assuming that all items are equally informative (same discrimination); (2) Two-Parameter Logistic Model (2PL) extends the 1PL model by introducing an additional parameter: the discrimination of each item. The 2PL model is more flexible than the 1PL because it allows items to vary in how effectively they distinguish between high- and low-ability learners.

4.2. Ensemble Learning

Ensemble learning is a powerful machine learning paradigm that combines multiple models to achieve better generalization and robustness than a single learner [19,20]. This approach trains multiple base learners and integrates their predictions to enhance performance, mitigate individual model limitations, and reduce errors. Ensemble methods are commonly categorized into boosting, bagging, and stacking [21]. Specifically, bagging involves training multiple learners in parallel on different subsets of the data and aggregating their outputs through majority voting or averaging. Bagging is particularly effective with high-variance, low-bias models such as decision trees, as it smooths out prediction fluctuations caused by slight changes in the training data. The most well-known implementation of bagging is the Random Forest algorithm, where decision trees are further diversified through random feature selection at each split.

4.3. Proposed Method

This study combines IRT and bagging to enhance performance and interpretability of DKT. The proposed approach consists of two stages: IRT-based DKT and Bagging DKT.

4.3.1. IRT-Based DKT:

Specifically, DeepIRT integrates IRT with a Dynamic Key Value Memory Network (DKVMN) to enable deep learning-based knowledge tracing explainable [22]. Within the DeepIRT framework, three IRT-based DKT models are implemented as follows:

1PL-based DKT implements an interpretable model based on 1PL. Each item j is characterized by a single parameter, its difficulty β_j. The model assumes that all items share the same discrimination power, meaning they are equally effective at distinguishing between learners of different ability levels. The probability that student i, with ability θ_i, answers item j correctly is given by:

P_{i j} (r_{i j} = 1 | θ, β) = \frac{1}{1 + e x p (- (θ - β)}

(2)

where:

-: θ_i is student ability;
-: β_j is item difficulty;
-: r_ij ∈ {0,1} indicates the response (1 = correct, 0 = incorrect);
-: P_ij is the predicted probability that student i correctly answers item j.

2PL-based DKT employs 2PL IRT to build an interpretable DKT model. In this model, each item j is characterized by difficulty parameter β_j and discrimination parameter α_j, which reflects how well the item differentiates between students of different ability levels:

P_{i j} (r_{i j} = 1| θ_{i}, β_{j}, α_{j}) = \frac{1}{1 + e x p [{- α}_{j} (θ - β)]}

(3)

where

α_{j}

is item discrimination.

The 2PL model is more flexible than the 1PL because it allows items to vary in how effectively they distinguish between high- and low-ability learners.

DeepIRT [22] introduces a variant of IRT with a weight λ applied to the ability parameter:

P_{i j} (r_{i j} = 1| θ_{i}, λ, β_{j}) = \frac{1}{1 + e x p [- (λ θ_{i} - β_{j})]}

(4)

Here, λ is the weight for the ability.

All three models retain the predictive performance of the deep learning-based knowledge tracing approach while enabling estimation of both the knowledge difficulty level and the student ability level over time through variants of IRT.

4.3.2. Bagging DKT

In this stage, the three IRT-based DKT models are aggregated using a heterogeneous bagging approach. The final prediction is determined by majority voting:

\hat{y} = a r g m a x \sum_{m = 1}^{3} 1 (f θ_{m} = k)

(5)

where k represents the possible class labels (passing or failing the course), f_θm denotes the m-th IRT-based DKT model, and 1(·) is the indicator function.

To implement the proposed method, the Dynamic Key Value Memory Network for Knowledge Tracing (DKVMN) addresses the knowledge tracing task by applying nonlinear transformations to learn latent representations and directly predict a student’s mastery level for each concept. It employs a static memory matrix, referred to as the key, to store the knowledge concepts, and a dynamic memory matrix, referred to as the value, to store and update the corresponding mastery levels over time [23].

5. Experiment

5.1. Dataset

To prepare the data for DKT modeling, several preprocessing steps were performed: 1. Data records lacking grades, containing incomplete courses, or corresponding to non-gradable courses were removed from the dataset. 2. Since the dataset included categorical features that are incompatible with most machine learning algorithms, these course-related categorical variables were transformed into numerical representations. Specifically, label encoding was applied to convert each unique nominal category into an integer value. The course subject and level features were thus encoded as integers to represent a student’s knowledge skill. 3. Courses sharing the same subject and level were treated as reflecting the same knowledge skill. For instance, courses such as MATH 3201 and MATH 3800, both belonging to the 3000-level mathematics category, were considered part of the same skill group.

The dataset was partitioned into training and testing sets for model evaluation. Table 1 presents the descriptive statistics of the training dataset consist of College of Engineering (COE), College of Art & Science (COAS), and the University while Table 2 summarizes the distribution and characteristics of the testing set; Civil & Environmental Engineering (CEE), Chemical Engineering (CHE), Computer Science (CSC), Electrical & Computer Engineering (ECE), and Mechanical Engineering (MCE) used for final performance validation. Specifically, the knowledge components (KCs) are presented. This curated dataset underpins the development and evaluation of interpretable deep learning models, particularly within the context of knowledge tracing, item response theory (IRT), and ensemble learning frameworks. Its depth and diversity make it ideal for investigating real-world academic performance, student risk detection, and course difficulty estimation in higher education settings.

5.2. Experiment Setup

Table 3 presents the experiment setup for this study. We configured the training pipeline with a batch size of 32 that offers a balance between computational efficiency and gradient stability. The training duration spanned 50 epochs, allowing the model sufficient exposure to the dataset while mitigating overfitting through gradual convergence. The learning rate was set to 0.003, which controls the magnitude of weight updates during backpropagation. This value is empirically known to provide an effective trade-off between learning speed and convergence reliability in similar deep neural architectures. The sequence length was fixed at 200, meaning each student learning trajectory was truncated or padded to 200 interactions, ensuring uniform input dimensions for the models. To optimize the model, the Adam optimizer was employed. The proposed method is implemented in TensorFlow and trained on NVIDIA V100 Graphics Processing Units (GPUs), ensuring efficient handling of large-scale computations, model complexity, and data volume.

Specifically, hold-out validation is employed in this study because it is well-suited for large datasets, where a single data split can reliably reflect the model’s performance. In this approach, the dataset is divided into two subsets: a training set used to train the model and a test (hold-out) set used to assess its performance on unseen data. The model is trained only once on the training data and then evaluated on the hold-out set, providing an estimate of its generalization ability.

5.3. Evaluation Metrics

In knowledge tracing, many approaches use binary classification to predict students’ academic performance, such as assessing the correctness of exercise completion [1,24].

In this study, we evaluate performance using accuracy, recall, precision, F1-score, and the Area Under the Receiver Operating Characteristic Curve (AUC).

Accuracy, a commonly employed metric in classification tasks, measures the ratio of correctly classified instances, encompassing both true positives and true negatives, relative to the total instances evaluated, thereby providing a straightforward assessment of prediction accuracy.

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(6)

Precision measures the proportion of true positive predictions out of all positive predictions [13]. Recall measures the proportion of true positive predictions out of all actual positive instances. F1-Score is a metric used to evaluate the performance of a classification model, particularly when dealing with imbalanced classes. It combines both precision and recall into a single measure

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

F 1 - S c o r e = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

where

TP (True Positive): correct predictions of passing courses.
TN (True Negative): correct predictions of failing courses.
FP (False Positive): incorrect predictions of passing courses.
FN (False Negative): incorrect predictions of failing courses.

Conversely, AUC serves as a valuable metric for assessing the performance of binary classifiers, particularly in scenarios characterized by class imbalance or variable importance of false positives and false negatives. A higher AUC value signifies superior discrimination capabilities of the model in distinguishing between positive and negative classes [25].

6. Results

In this section, we evaluate the proposed method against several baseline models, demonstrating its superior performance across multiple metrics as well as its ability to generate more interpretable results with respect to student ability, prediction probability, and item difficulty compared to DeepIRT [22] on the datasets from PVAMU [2,3].

6.1. Performance Comparison of SLOs Tracing Across Different Training and Testing Settings

Table 4 presents a performance comparison between several baseline models (1PLbased DKT, 2PL-based DKT, and DeepIRT) and the proposed method across three different training and testing settings: (1) training on university data and testing on university data, (2) training on COE data and testing on COE data, and (3) training on College of Arts & Science (COAS) data and testing on COE data. Evaluation metrics include AUC, ACC, Precision, Recall, and F1-Score. These metrics collectively measure the models’ ability to balance prediction accuracy, sensitivity, and overall robustness.

When trained and tested on university data, the proposed method achieves the highest AUC (63.97) and ACC (80.56), indicating better discriminative ability and classification accuracy compared to the baselines. Importantly, its Recall (0.9336) is the best among all models, suggesting superior capability in correctly identifying positive cases. While Precision and F1-Score are incredibly competitive, with only marginal differences compared to 1PL-based DKT, the overall balance across metrics makes the proposed method the most reliable performer in this setting.

On trained and tested on COE data, the proposed method again demonstrates robust performance. It records the highest AUC (66.54) and ACC (78.46), surpassing all baselines. Its Recall (0.9265) is the best across models, showing consistency in identifying true positives. Although Precision and F1-Score values are remarkably close to those of the best-performing baseline models, the proposed method maintains a strong overall balance across metrics, confirming its robustness and adaptability to different datasets.

When training on COAS data and testing on COE data, the proposed method performs competitively, achieving the highest Recall (0.8778), which highlights its effectiveness in generalizing unseen data. Although its AUC (63.14) is slightly lower than some baselines, the strong Recall and balanced F1-Score (0.8563) suggest it is better at capturing critical cases than the baselines.

Taken together, the proposed method consistently provides the best or near-best results across datasets, excels in Recall (critical for real-world applications where missing positive cases is costly), and achieves high overall accuracy. This demonstrates that the proposed approach offers not just incremental gains but meaningful improvements in robustness, generalization, and predictive reliability.

Additionally, Table 5 presents a comparative evaluation of different baseline models and the proposed method across multiple departments CEE, CHE, CSC, ECE, and MCE when trained on COE data. Across all test sets, the proposed method consistently achieves competitive or superior performance in terms of AUC, Accuracy (ACC), Precision, Recall, and F1score, demonstrating strong generalization across departments. For example, regarding the cases of the CEE and CHE testing data, the proposed method achieves both the highest accuracy (71.14% and 84.62%, respectively) and the highest F1-scores (0.8181 and 0.9142), clearly outperforming the baseline DKT and DeepIRT approaches.

A notable strength of the proposed method lies in its ability to achieve high recall without sacrificing precision significantly. For instance, for the case of testing on CHE data, it reaches a recall of 0.9523 while maintaining a precision of 0.8791, striking a balance between identifying correct instances and avoiding false positives. This trend is consistent in other testing cases such as CSC and ECE, where recall values (0.9731 and 0.9792) surpass competing models. Such performance suggests that the proposed method excels at capturing a wide range of correct predictions for SLOs tracing, making it more robust for educational data modeling.

When compared to strong baselines like DeepIRT, the proposed method demonstrates clear improvements in F1-scores across most departments. While DeepIRT sometimes shows competitive precision (e.g., 0.8849 in CHE), the proposed method consistently outperforms in overall balanced performance as reflected in the F1-score. For example, in CSC, both DeepIRT and the proposed method achieve similar accuracy levels, but the proposed method achieves a slightly higher F1-score (0.9084 vs. 0.9075). This suggests that while other models may excel in individual metrics, the proposed method achieves the best trade-off across all evaluation measures.

Finally, the results on MCE further reinforce the robustness of the proposed approach. The proposed method achieves performance metrics that are better than the baselines, highlighting their standalone effectiveness without requiring additional complexity. Overall, the results validate the proposed method as a strong, generalizable solution for cross-department SLOs tracing, with strengths in recall and balanced performance, making it superior to existing baselines.

6.2. Visualization of the Interpretability of the Proposed Method

Figure 1 shows a comparison of the visualized outputs from DeepIRT and the proposed method across three heatmaps: student ability, predicted probability of SLOs, and item difficulty. Both methods generate meaningful representations of student performance, yet clear differences emerge in the way they capture ability, prediction accuracy, and task difficulty. These differences highlight the strengths of the proposed method in refining interpretability and predictive power.

The student ability heatmaps reveal that the proposed method provides a more differentiated and consistent representation of learner strengths. In cases of correct predictions (solid dots), the proposed method often assigns higher ability estimates compared to DeepIRT. This indicates that the proposed model better recognizes when students demonstrate mastery, thereby offering a more accurate reflection of their learning progress. Such improvements are crucial for adaptive learning systems, where precise ability estimation drives personalized recommendations.

The item difficulty heatmaps further demonstrate the advantage of the proposed method. For correctly answered items, the proposed method tends to assign higher difficulty values than DeepIRT, suggesting that it more effectively distinguishes genuinely challenging (proposed method) and easy items (DeepIRT). This refined estimation enables a clearer understanding of task characteristics and improves the balance between student ability and item difficulty in the model’s predictions. As a result, the proposed method enhances the interpretability of test items and their alignment with student performance.

Taken together, the improvements in ability estimation and difficulty calibration contribute to the enhanced prediction probabilities shown in the middle heatmaps. The proposed method produces smoother and more consistent probability estimates, which reduces noise and improves interpretability over DeepIRT. These contributions highlight the proposed method’s potential as a more robust and interpretable approach for modeling student learning outcomes.

6.3. Discussion

The results of this study demonstrate that the proposed ensemble IRT-based DKT method significantly improves the interpretability and predictive performance of SLO tracing compared with strong baselines such as DeepIRT. While DeepIRT occasionally achieves competitive precision in certain departments (e.g., 0.8849 in CHE), the proposed method consistently provides a better balance across evaluation metrics, as reflected in the F1-scores. For instance, in the CSC department, both DeepIRT and the proposed method achieve similar accuracy levels, but our method attains a slightly higher F1-score (0.9084 vs. 0.9075), indicating a more robust overall performance. These improvements are largely due to enhanced ability estimation and difficulty calibration, which produce smoother and more consistent predicted probabilities, reducing noise and improving interpretability over conventional DKT approaches.

By integrating multiple IRT-based DKT models through a bagging ensemble, the proposed method not only improves prediction accuracy but also enhances transparency in SLO tracing. Visualizations of student ability, item difficulty, and predicted probabilities allow educators to better understand student learning patterns and identify at-risk students for timely intervention. Compared with previous implementations, such as DeepIRT and the SLO model at PVAMU, the ensemble approach provides a more comprehensive representation of knowledge states and mitigates the limitations of single-model approaches. Evaluations on a diverse PVAMU dataset, spanning multiple colleges and departments, confirm that the proposed method consistently outperforms individual models across AUC, accuracy, precision, and F1-score. Overall, these findings suggest that combining ensemble learning with IRT offers a practical and interpretable framework for monitoring student learning outcomes, supporting evidence-based instructional strategies, and informing decision-making in educational settings.

6.4. Practical Implications

The proposed SLO tracing method presents multiple practical implications for educators and administrators. By providing interpretable insights into student ability, item difficulty, and predicted probabilities, the approach enables instructors to identify students who may be at risk of failing specific courses early in the semester. This allows for timely interventions, such as personalized tutoring, targeted feedback, or curriculum adjustments, ultimately supporting more effective and data-driven instructional strategies.

In addition, the use of ensemble learning combined with IRT enhances the reliability and robustness of SLO predictions. Educational institutions can adopt this approach to monitor overall student learning progress across multiple courses, departments, or programs, helping administrators make informed decisions on resource allocation, curriculum design, and academic support programs. The transparency and interpretability of the visualizations further ensure that stakeholders can trust and act upon the model’s outputs, fostering a more evidence-based approach to improving student outcomes.

6.5. Limitations

While the proposed ensemble IRT-based DKT framework demonstrates improved interpretability and predictive performance for SLO tracing, it still has several limitations. First, the proposed ensemble IRT-based DKT framework was evaluated using data from a single institution, which may limit the generalizability of the findings to other educational contexts or learner populations. The model’s performance and interpretability could vary across institutions with different curricula, assessment structures, or student demographics.

Secondly, the ensemble framework employed a fixed set of IRT variants (e.g., 1PL, 2PL, DeepIRT), which constrains its adaptability to diverse learning conditions. Relying on predetermined models may not fully capture the dynamic nature of student learning behaviors or variations in course difficulty, thereby limiting the model’s flexibility and broader applicability.

7. Conclusions

This quantitative study further enhances interpretability in DKT by integrating bagging with IRT to advance SLO tracing. By combining the predictive strengths of the 1PL, 2PL, and DeepIRT [10,11,22] models, the proposed ensemble framework demonstrates superior performance in forecasting student success, particularly in future courses. Using academic data from PVAMU, a HBCU [2,3], our evaluation confirms that the ensemble consistently outperforms individual models across key metrics including AUC, accuracy, precision, and F1-score. Importantly, incorporating the learned item discrimination parameters from the 2PL model enables a more nuanced understanding of how different items influence student performance, allowing the model to distinguish between easy and difficult content. This level of transparency is especially critical in educational environments such as HBCUs, where equity, insight, and accountability are fundamental to institutional goals.

Future directions for this research on interpretable SLO tracing include two main areas: 1. Model generalization: evaluating and adapting the proposed model across diverse learner populations and institutional contexts to ensure that its interpretability and accuracy extend beyond the current dataset; and 2. Model enhancement: extending the ensemble framework beyond a fixed set of IRT variants (e.g., 1PL, 2PL, DeepIRT) by exploring adaptive ensemble strategies such as dynamic weighting based on student characteristics, content difficulty, or uncertainty estimation through Bayesian deep learning methods.

Author Contributions

Methodology, C.O., L.Q., P.O. and X.D.; Validation, C.O.; Writing – original draft, C.O.; Writing – review & editing, C.O., L.Q., P.O. and X.D.; Visualization, C.O.; Supervision, L.Q. and X.D.; Funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research work is supported by NASA under award number 80NSSC22KM0052, by the Army Research Office under cooperative agreement number W911NF-24-2-0133, and by the NSF under award number 2235731, 2428761.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of NASA, NSF, or the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Abbreviations

The following abbreviations are used in this manuscript:

ACC	Accuracy
AUC	Area Under the Curve
CEE	Civil Engineering & Environmental
CHE	Chemical Engineering
CSC	Computer Science
COE	College of Engineering
COAS	College of Arts & Science
DeepIRT	Deep Item Response Theory
DKT	Deep Knowledge Tracing
DKVMN	Dynamic Key Value Memory Network
ECE	Electrical & Computer Engineering
EDM	Educational Data Mining
FP	False Positive
FN	False Negative
HBCU	Historically Black college and university
ISKT	Intention-Aware Knowledge Tracing for Learning Stage
IRT	Item Response Theory
KCDKT	Knowledge Component-integrated Deep Knowledge Tracing
KQN	Knowledge Query Network
LSTM	Long Short-Term Memory
KCs	Knowledge Components
MCE	Mechanical Engineering
MIRT	Multidimensional Item Response Theory
PL	Parameter Logistic Model (1PL
PVAMU	Prairie View A&M University
RCKT	Response Influence based Counterfactual Reasoning
SAKT	Self Attentive Knowledge Tracing
SLO	Student learning outcome
STEM	Science, Technology, Engineering, and Mathematics
TN	True Negative
TP	True Positive

References

Song, X.; Li, J.; Cai, T.; Yang, S.; Yang, T.; Liu, C. A survey on deep learning-based knowledge tracing. Knowl. Based Syst. 2022, 258, 110036. [Google Scholar] [CrossRef]
Kuo, M.-M.; Li, X.; Obiomon, P.; Qian, L.; Dong, X. Improving student learning outcome tracing at HBCUs using tabular generative ai and deep knowledge tracing. IEEE Access 2025, 13, 82407. [Google Scholar] [CrossRef]
Kuo, M.M.; Li, X.; Obiomon, P.; Qian, L.; Dong, X. Tracing student learning outcome at Historically Black Colleges and Universities via deep knowledge tracing. IEEE Access 2025, 13, 61340–61349. [Google Scholar] [CrossRef]
Huang, C.-Q.; Huang, Q.-H.; Huang, X.; Wang, H.; Li, M.; Lin, K.-J.; Chang, Y. XKT: Towards explainable knowledge tracing model with cognitive learning theories for questions of multiple knowledge concepts. IEEE Trans. Knowl. Data Eng. 2024, 36, 7308–7325. [Google Scholar] [CrossRef]
Lu, Y.; Wang, D.; Chen, P.; Meng, Q.; Yu, S. Interpreting deep learning models for knowledge tracing. Int. J. Artif. Intell. Educ. 2023, 33, 519–542. [Google Scholar] [CrossRef]
Li, Q.; Yuan, X.; Liu, S.; Gao, L.; Wei, T.; Shen, X.; Sun, J. A genetic causal explainer for deep knowledge tracing. IEEE Trans. Evol. Comput. 2023, 28, 861–875. [Google Scholar] [CrossRef]
Cai, L.; Choi, K.; Hansen, M.; Harrell, L. Item response theory. Annu. Rev. Stat. Its Appl. 2016, 3, 297–321. [Google Scholar] [CrossRef]
Tsutsumi, E.; Nishio, T.; Ueno, M. Deep-IRT with a temporal convolutional network for reflecting students’ long-term history of ability data. In Artificial Intelligence in Education, Proceedings of the International Conference on Artificial Intelligence in Education, Recife, Brazil, 8–12 July 2024; Springer: Cham, Switzerland, 2024; pp. 250–264. [Google Scholar]
Gu, W.; Liu, Z.; Liu, S. Interpretable deep knowledge tracing with graph relationship information. In Proceedings of the 2023 IEEE 5th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 11–13 October 2023; IEEE: New York City, NY, USA, 2023; pp. 290–295. [Google Scholar]
Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep knowledge tracing. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Xu, F.; Chen, K.; Zhong, M.; Liu, L.; Liu, H.; Luo, X.; Zheng, L. Dkvmn&mri: A new deep knowledge tracing model based on dkvmn incorporating multi-relational information. PLoS ONE 2024, 19, e0312022. [Google Scholar]
Lu, Y.; Tong, L.; Cheng, Y. Advanced knowledge tracing: Incorporating process data and curricula information via an attention-based framework for accuracy and interpretability. J. Educ. Data Min. 2024, 16, 58–84. [Google Scholar]
Yang, Q.; Chi, J.; Chen, W.; Wu, Z.; Huang, Y.; Zhang, J. Learning intention-aware knowledge tracing for learning stage. Discov. Comput. 2025, 28, 98. [Google Scholar] [CrossRef]
Chen, J.; Liu, Z.; Huang, S.; Liu, Q.; Luo, W. Improving interpretability of deep sequential knowledge tracing models with question-centric cognitive representations. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 14196–14204. [Google Scholar]
Cui, J.; Yu, M.; Jiang, B.; Zhou, A.; Wang, J.; Zhang, W. Interpretable knowledge tracing via response influence-based counterfactual reasoning. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 16–19 April 2024; IEEE: New York City, NY, USA, 2024; pp. 1103–1116. [Google Scholar]
Hambleton, R.K.; Swaminathan, H.; Rogers, H.J. Fundamentals of Item Response Theory; Sage: Thousand Oaks, CA, USA, 1991; Volume 2. [Google Scholar]
Liu, F.; Bu, C.; Zhang, H.; Wu, L.; Yu, K.; Hu, X. FDKT: Towards an interpretable deep knowledge tracing via fuzzy reasoning. ACM Trans. Inf. Syst. 2024, 42, 139. [Google Scholar] [CrossRef]
Frick, S.; Krivosija, A.; Munteanu, A. Scalable learning of item response theory models. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 2–4 May 2024; PMLR: Cambridge, MA, USA, 2024; pp. 1234–1242. [Google Scholar]
Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
Mienye, D.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Ribeiro, M.H.D.M.; dos Santos Coelho, L. Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series. Appl. Soft Comput. 2020, 86, 105837. [Google Scholar] [CrossRef]
Yeung, C.-K. Deep-IRT: Make deep learning-based knowledge tracing explainable using item response theory. arXiv 2019, arXiv:1904.11738. [Google Scholar]
Zhang, J.; Shi, X.; King, I.; Yeung, D.-Y. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, Perth, Australia, 3–7 April 2017; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2017; pp. 765–774. [Google Scholar] [CrossRef]
Wang, J.; Jing, X.; Yan, Z.; Fu, Y.; Pedrycz, W.; Yang, L.T. A survey on trust evaluation based on machine learning. ACM Comput. Surv. 2020, 53, 107. [Google Scholar] [CrossRef]
Gervet, T.; Koedinger, K.; Schneider, J.; Mitchell, T. When Is Deep Learn. Best Approach Knowledge Tracing? J. Educ. Data Min. 2020, 12, 31–54. [Google Scholar]

Figure 1. Comparison of the visualized outputs produced by DeepIRT and the proposed method. The outputs involve three types of heatmaps: student ability, SLOs prediction probability, and item difficulty.

Table 1. Training Dataset Summary.

Training Dataset	# of Records	# of Students	# of KCs
COE	36,026	2036	165
COE + COAS	101,529	6179	206
University	24,964	16,549	224

Table 2. Testing Data Summary.

Testing Dataset	# of Records	# of Students	# of KCs
CEE	1043	130	57
CHE	1437	147	57
CSC	2461	284	67
ECE	2173	244	75
MCE	3337	387	83
COE	10,451	1182	125
University	79,305	9102	200

Table 3. Training Configuration Summary.

Key Configuration Setting	Key Configuration Setting
Batch Size	32
Number of Epochs 50	50
Learning Rate 0.003	0.003
Sequence Length 200	200
Optimizer Adam	Adam

Table 4. Performance comparison between the baselines and the proposed method across different training and testing settings.

Training: University Data, Testing: University Data
Model	AUC	ACC	Precision	Recall	F1-Score
1PL-based DKT 2PL-based DKT DeepIRT Proposed Method	62.79 62.06 62.24 63.97	80.45 79.50 79.05 80.56	0.8476 0.8459 0.8462 0.8465	0.9302 0.9188 0.9118 0.9336	0.8870 0.8808 0.8778 0.8879
Training: COE Data, Testing: COE Data
Model	AUC	ACC	Precision	Recall	F1-Score
1PL-based DKT 2PL-based DKT DeepIRT Proposed Method	65.70 64.17 65.48 66.54	77.75 76.50 77.00 78.46	0.8251 0.8215 0.8223 0.8264	0.9176 0.9039 0.9104 0.9265	0.8689 0.8607 0.8641 0.8736
Training: COAS Data, Testing: COE Data
Model	AUC	ACC	Precision	Recall	F1-Score
1PL-based DKT 2PL-based DKT DeepIRT Proposed Method	62.84 62.46 61.13 63.14	76.57 75.18 74.41 76.34	0.8382 0.8393 0.8349 0.8359	0.8778 0.8547 0.8496 0.8778	0.8575 0.8469 0.8421 0.8563

Table 5. Performance comparison between the baselines and the proposed method when training on COE data and testing different department data.

Testing CEE data
Model	AUC	ACC	Precision	Recall	F1-Score
1PL-based DKT 2PL-based DKT DeepIRT Proposed Method	55.32 57.80 55.51 57.98	69.99 68.26 68.74 71.14	0.7780 0.7807 0.7707 0.7773	0.8406 0.8036 0.8316 0.8635	0.8081 0.7920 0.8000 0.8181
Testing CHE data
Model	AUC	ACC	Precision	Recall	F1-Score
1PL-based DKT 2PL-based DKT DeepIRT Proposed Method	56.57 54.77 56.12 58.41	79.61 71.33 83.23 84.62	0.8770 0.8747 0.8849 0.8791	0.8876 0.7785 0.9256 0.9523	0.8823 0.8238 0.9048 0.9142
Testing CSC data
Model	AUC	ACC	Precision	Recall	F1-Score
1PL-based DKT 2PL-based DKT DeepIRT Proposed Method	50.21 50.25 55.71 56.14	81.39 83.38 83.26 83.42	0.8489 0.8565 0.8512 0.8518	0.9485 0.9649 0.9716 0.9731	0.8960 0.9075 0.9075 0.9084
Testing ECE data
Model	AUC	ACC	Precision	Recall	F1-Score
1PL-based DKT 2PL-based DKT DeepIRT Proposed Method	62.54 60.11 60.86 64.34	79.52 79.11 78.97 79.34	0.8082 0.8116 0.8045 0.8044	0.9746 0.9614 0.9729 0.9792	0.8837 0.8801 0.8807 0.8832
Testing MCE data
Model	AUC	ACC	Precision	Recall	F1-Score
1PL-based DKT 2PL-based DKT DeepIRT Proposed Method	61.31 61.93 62.23 63.71	71.32 71.35 71.89 73.45	0.7969 0.7909 0.7957 0.7974	0.8407 0.8521 0.8528 0.8770	0.8182 0.8204 0.8233 0.8353

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Onyeke, C.; Qian, L.; Obiomon, P.; Dong, X. Integrating Ensemble Learning with Item Response Theory to Improve the Interpretability of Student Learning Outcome Tracing. Appl. Sci. 2025, 15, 12594. https://doi.org/10.3390/app152312594

AMA Style

Onyeke C, Qian L, Obiomon P, Dong X. Integrating Ensemble Learning with Item Response Theory to Improve the Interpretability of Student Learning Outcome Tracing. Applied Sciences. 2025; 15(23):12594. https://doi.org/10.3390/app152312594

Chicago/Turabian Style

Onyeke, Christian, Lijun Qian, Pamela Obiomon, and Xishuang Dong. 2025. "Integrating Ensemble Learning with Item Response Theory to Improve the Interpretability of Student Learning Outcome Tracing" Applied Sciences 15, no. 23: 12594. https://doi.org/10.3390/app152312594

APA Style

Onyeke, C., Qian, L., Obiomon, P., & Dong, X. (2025). Integrating Ensemble Learning with Item Response Theory to Improve the Interpretability of Student Learning Outcome Tracing. Applied Sciences, 15(23), 12594. https://doi.org/10.3390/app152312594

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Ensemble Learning with Item Response Theory to Improve the Interpretability of Student Learning Outcome Tracing

Abstract

1. Introduction

2. Related Work

3. Task Definition

4. Methods

4.1. Item Response Theory (IRT)

4.2. Ensemble Learning

4.3. Proposed Method

4.3.1. IRT-Based DKT:

4.3.2. Bagging DKT

5. Experiment

5.1. Dataset

5.2. Experiment Setup

5.3. Evaluation Metrics

6. Results

6.1. Performance Comparison of SLOs Tracing Across Different Training and Testing Settings

6.2. Visualization of the Interpretability of the Proposed Method

6.3. Discussion

6.4. Practical Implications

6.5. Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI